I'm using StringTie to identify novel isoforms from mouse RNAseq data (60M PE reads/sample, stranded, ribodepletion method). The data has been mapped to the mouse genome using STAR. I'm finding that 57% of isoforms are novel. Someone mentioned to me that so many are being identified as novel isoforms by StringTie because single base differences among transcripts was sufficient to call them “different” but that there were ways to relax this so that only truly novel exons were identified. I've looked in the manual for a way to relax this, but couldn't find anything. Here is the code I used for STAR as well as StringTie:
Make Genome Index
STAR --runThreadN 40 --runMode genomeGenerate --sjdbOverhang 199 --genomeDir GENOME_data/star \ --genomeFastaFiles GENOME_data/GRCm38.primary_assembly.genome.fa \ --sjdbGTFfile GENOME_data/Mus_musculus.GRCm38.100.gtf
Map reads to Genome
STAR --genomeDir GENOME_data/star --readFilesIn 2_Forward.fq 2_Reverse.fq --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within --twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM --outSAMstrandField intronMotif --runThreadN 16 --outFileNamePrefix "2_star/"
Use StringTie to Quantify Transcripts
stringtie-2.1.4/stringtie 2_star/Aligned.sortedByCoord.out.bam -G GENOME_data/Mus_musculus.GRCm38.100.gtf -A 2_gene_abund.tab -B -o 2.gtf
Well you could always cluster the sequences after the fact. Use
mmseqs2 easy-cluster
and cluster at a high threshold (e.g., 97% identity and 97% bidirectional coverage), and that'd eliminate "redundant" isoforms.