Hi,
I originally used featurecounts to assign reads to known transcripts of the mm10 genome. The percentage of fragments assigned was between 70-75% for my samples. Using this original gtf file the number of features (exons) were 841916, and meta-features (genes) were 55421.
I since came across StringTie and wanted to repeat the assignment of reads, but instead using the stringtie output gtf which should contain novel transcripts as well as the known ones from the original gtf.
However when I used featurecounts with this new gtf I get much lower assignment of reads (20-25%). Also the number of features is much smaller (438272) whereas metafeatures is larger (80351), meaning fewer exons but more genes in the new gtf?? Code below to make new gtf and then assign features.
stringtie all_samples_sortedByCoord.out.bam -o all_samples_sortedByCoord.gtf -p 8 -G gencode.vM23.primary_assembly.annotation.gtf --fr -A all_samples_sortedByCoord.tab
featureCounts -T 4 -p -g gene_id -s 2 -a all_samples_sortedByCoord.gtf -o PE_samples_featureCounts_novel_gtf.txt BA*_sortedByCoord.out.bam
Not sure where I've gone wrong, why are less reads being assigned to a file which should contain both known and novel transcripts compared to just the known transcripts I originally did. Why are there more genes but fewer metafeatures (exons) in my new gtf? Any help appreciated!
Thanks for the suggestion, I tried it and using that output gtf it gave me the original assigned percentage with featurecounts. But I read in the manual that -e option causes reads with no reference transcript to be skipped, so I think this is missing out any novel transcripts? manual description: Limits the processing of read alignments to only estimate and output the assembled transcripts matching the reference transcripts given with the -G option (requires -G, recommended for -B/-b). With this option, read bundles with no reference transcripts will be entirely skipped, which may provide a considerable speed boost when the given set of reference transcripts is limited to a set of target genes, for example.
Yes that's weird indeed. I don't use stringtie at all so this is new to me. I think what needs to be done is the two operations need to be merged. If you follow their "Differential expression analysis" workflow you'll see a merged step that generates a merged GTF file: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual
This should satisfy both of your requirements of having novel and known transcripts annotated.