Hi All. I am using StringTie to assemble transcriptome from my RNA-Seq data. The question is that if I use refSeq as reference annotation which was download from UCSC genome browser website, does the CDS, start codon and stop codon segments in that gtf file will affect transcriptome assembly? Like would StringTie consider CDS/start codon/stop codon as a new exon, but actually these features are just parts of exons?
The gtf file looks like this:
chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene exon 66999639 67000051 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67091530 67091593 0.000000 + 2 gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene exon 67091530 67091593 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
Thanks for comment. This is very important to me. Could you tell me where can I get the refSeq, UCSC GTF file in this format? Since I need all of refSeq, UCSC and gencode.
Is this a problem? Is that possible that any different transcript belong to different gene?
As far as I understand, not having distinct transcript IDs in GTF is a problem, unless there is only one transcript for each gene. This is clearly not the case in humans/ higher vertebrates.
As for the GTF, if your concern is all known/ predicted gene information, then you could consider the Gencode Comprehensive set which has better coverage than RefSeq (here is a related article) alone and would be as good as combining RefSeq and UCSC.
In fact if you visit UCSC Genome Browser, Gencode is now the default gene track.
Thank you! I have seen the problem: now if I do differential expression analysis, I would get multiple expression values like FPKM values for only one gene, and this will make the DE result very misleading. For the GTF files, your suggest is that gencode comprehensive set had already cover all known/predict genes, then I don't need to consider refSeq and UCSC annotation anymore. Is that right?
Yes, thats right. I too have wasted time using GTF from UCSC and then facing the same problem: Transcript isoform level result from Cufflinks didn't make sense.
Anyways, use the Gencode Comprehensive set and you would have better coverage.