Using StringTie output GTF with featureCounts to assign reads- low assigned percentage
1
1
Entering edit mode
5.1 years ago
sg197 ▴ 40

Hi,

I originally used featurecounts to assign reads to known transcripts of the mm10 genome. The percentage of fragments assigned was between 70-75% for my samples. Using this original gtf file the number of features (exons) were 841916, and meta-features (genes) were 55421.

I since came across StringTie and wanted to repeat the assignment of reads, but instead using the stringtie output gtf which should contain novel transcripts as well as the known ones from the original gtf.

However when I used featurecounts with this new gtf I get much lower assignment of reads (20-25%). Also the number of features is much smaller (438272) whereas metafeatures is larger (80351), meaning fewer exons but more genes in the new gtf?? Code below to make new gtf and then assign features.

stringtie all_samples_sortedByCoord.out.bam -o all_samples_sortedByCoord.gtf -p 8 -G gencode.vM23.primary_assembly.annotation.gtf --fr -A all_samples_sortedByCoord.tab

featureCounts -T 4 -p -g gene_id -s 2 -a all_samples_sortedByCoord.gtf -o PE_samples_featureCounts_novel_gtf.txt BA*_sortedByCoord.out.bam

Not sure where I've gone wrong, why are less reads being assigned to a file which should contain both known and novel transcripts compared to just the known transcripts I originally did. Why are there more genes but fewer metafeatures (exons) in my new gtf? Any help appreciated!

RNA-Seq featurecounts stringtie • 2.1k views
ADD COMMENT
0
Entering edit mode
5.1 years ago
Mark ★ 1.6k

In the StringTie manual it states:

Note that if option -e is not used the reference transcripts need to be fully covered by reads in order to be included in StringTie's output. In that case, other transcripts assembled from the data by StringTie and not present in the reference file will be printed as well.

Try the -e option in stringtie to see what effects it has.

ADD COMMENT
0
Entering edit mode

Thanks for the suggestion, I tried it and using that output gtf it gave me the original assigned percentage with featurecounts. But I read in the manual that -e option causes reads with no reference transcript to be skipped, so I think this is missing out any novel transcripts? manual description: Limits the processing of read alignments to only estimate and output the assembled transcripts matching the reference transcripts given with the -G option (requires -G, recommended for -B/-b). With this option, read bundles with no reference transcripts will be entirely skipped, which may provide a considerable speed boost when the given set of reference transcripts is limited to a set of target genes, for example.

ADD REPLY
0
Entering edit mode

Yes that's weird indeed. I don't use stringtie at all so this is new to me. I think what needs to be done is the two operations need to be merged. If you follow their "Differential expression analysis" workflow you'll see a merged step that generates a merged GTF file: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual

This should satisfy both of your requirements of having novel and known transcripts annotated.

ADD REPLY

Login before adding your answer.

Traffic: 1904 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6