I read the GTF parsing code in Tophat before. It reads exon , cds and transcript info. they add a lot of statements to maintain strong compatibility to different file format and deal with special cases, which made it not a enjoyable experience to read the source code. :P
I've moved this to an answer since it's much more informative than mine, given that you've actually slogged through the source code! Thanks doing so and mentioning what's actually in there!
Ah, the ones from UCSC and Ensembl that I have locally don't have those, though one can add pretty much anything in that field. No, GFF and GTF aren't the same (they're similar). GFF, in fact, has multiple variants itself. How tophat parses the annotation file isn't documented anywhere to my knowledge, so unless you want to bother going through the source code, just run the gtf_to_fasta command (that's how it creates the transcriptome reference) and have a look at it. At least in the case of the annotation files I've used, it looks at exon features.
i'll have a look. I can try removing the lines with certain features(transcript/exon) and see if it still builds it. But I too guess that it parses the exons. And since it worked for your GTF with no feature called transcript, it surely picks up exons only. Would you convert your comment to an answer so that I can accept it?
Sure, done. BTW, please post back if you do play around with things and let us know what happens with the transcript features. It's always nice when one can at least piece together some documentation with google!
I removed the lines corresponding to exon and transcript and compared the fasta generated from original GTF with the reduced GTF. The files are same; It still seems to work!! Can't understand how!! Perhaps it is parsing the CDS and UTR features as an alternative. Would need to look at source code (haven't worked with c++ in a long time :P)
If I have finished the Tophat step (get the accpeted_hits.bam), and I have looked at the corresponding area by the command samtools view input.bam "Chr1:10000-20000" and I found there are some reads at this area. Then if I want to look at the expression some part of genome (we called it pseudo gene, however, this pseudo gene could be real gene, but which is not annotated by Reference genome e.g. Ensembl), how could I do it?
Could I directly add this pseudo gene to the GTF/GFF file and then execute the Cuffdiff/edgeR? or I need to re-do the mapping step based on new GTF/GFF file?
Which feature I need pay attention? I add 'exon', 'CDS' and 'transcript', is that all?
Thank you!
ADD REPLY
• link
updated 5.1 years ago by
Ram
44k
•
written 9.1 years ago by
super
▴
60
0
Entering edit mode
It depends on whether your region of interest is similar to an annotated gene. If it is, then tophat2 would show fewer reads aligning than reality. If your region has low similarity then just rerun cufflinks/cuffdiff and/or featureCounts/edgeR.
If you're using a GTF file then only 'exon' matters. For a GFF you'll need 'gene' and 'transcript' too.
But why I get noting mapped when I ran htseq-count? aligned.bam is the BAM file which represents the sample mapped to sheep genome by Tophat.
Thank you!
ADD REPLY
• link
updated 5.1 years ago by
Ram
44k
•
written 8.9 years ago by
super
▴
60
0
Entering edit mode
One file has chromosomes named "1", the other "chr1". featureCounts will typically deal with that, but htseq-count won't.
Aside from that, whether the first read is included will depend on the MAPQ filtering settings. Further, have a look at the strand settings and compare that to the strand of the feature.
GTF files don't normally have a transcript feature, just
exon
,CDS
,start_codon
, andstop_codon
. Are you sure you're not talking about a GFF file?Atleast GENCODE GTFs have it. The one that I am using right now is from ENSEMBL. It also has the following features:
CDS, start_codon, transcript, gene, exon, Selenocysteine, UTR, stop_codon
BTW I thought GFF and GTF are synonymous because it is always written
GFF/GTF
.