Question

Does CDS, start_codon and stop_codon in gtf affect transcriptome assembly by StringTie?

1

Entering edit mode

8.6 years ago

syrttgump ▴ 50

Hi All. I am using StringTie to assemble transcriptome from my RNA-Seq data. The question is that if I use refSeq as reference annotation which was download from UCSC genome browser website, does the CDS, start codon and stop codon segments in that gtf file will affect transcriptome assembly? Like would StringTie consider CDS/start codon/stop codon as a new exon, but actually these features are just parts of exons?

The gtf file looks like this:

chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";

chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";

chr1 hg19_refGene exon 66999639 67000051 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";

chr1 hg19_refGene CDS 67091530 67091593 0.000000 + 2 gene_id "NM_032291"; transcript_id "NM_032291";

chr1 hg19_refGene exon 67091530 67091593 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";

RNA-Seq StringTie annotation gtf Assembly • 3.7k views

ADD COMMENT • link updated 8.6 years ago by Amitm ★ 2.3k • written 8.6 years ago by syrttgump ▴ 50

score 4 · Accepted Answer · 2016-05-26

4

Entering edit mode

8.6 years ago

Amitm ★ 2.3k

hi, Plz. do not use this GTF file from UCSC. A GTF file not only has the information of individual exons (of a transcript isoform) but also of different transcripts (that originate from a particular gene). You would notice that the gene_id and transcript_id are same in the above GTF file. So any transcript assembler you use (like StringTie) would not be able to infer the transcript <-> gene relationship.

See this GTF structure from Ensembl -

1   protein_coding  exon    874655  874840  .   +   .   gene_id "ENSG00000187634"; transcript_id "ENST00000455979"; exon_number "1"; gene_name "SAMD11"; gene_biotype "protein_coding"; transcript_name "SAMD11-004"; exon_id "ENSE00002715021";
1   protein_coding  CDS 874655  874840  .   +   2   gene_id "ENSG00000187634"; transcript_id "ENST00000455979"; exon_number "1"; gene_name "SAMD11"; gene_biotype "protein_coding"; transcript_name "SAMD11-004"; protein_id "ENSP00000412228";
1   protein_coding  exon    876524  876686  .   +   .   gene_id "ENSG00000187634"; transcript_id "ENST00000455979"; exon_number "2"; gene_name "SAMD11"; gene_biotype "protein_coding"; transcript_name "SAMD11-004"; exon_id "ENSE00003477353";
1   protein_coding  CDS 876524  876686  .   +   2   gene_id "ENSG00000187634"; transcript_id "ENST00000455979"; exon_number "2"; gene_name "SAMD11"; gene_biotype "protein_coding"; transcript_name "SAMD11-004"; protein_id "ENSP00000412228";

Hope this is clear. Plz use GTF from Ensembl or Gencode

ADD COMMENT • link 8.6 years ago by Amitm ★ 2.3k

0

Entering edit mode

Thanks for comment. This is very important to me. Could you tell me where can I get the refSeq, UCSC GTF file in this format? Since I need all of refSeq, UCSC and gencode.

ADD REPLY • link 8.6 years ago by syrttgump ▴ 50

0

Entering edit mode

Is this a problem? Is that possible that any different transcript belong to different gene?

ADD REPLY • link 8.6 years ago by syrttgump ▴ 50

0

Entering edit mode

As far as I understand, not having distinct transcript IDs in GTF is a problem, unless there is only one transcript for each gene. This is clearly not the case in humans/ higher vertebrates.

As for the GTF, if your concern is all known/ predicted gene information, then you could consider the Gencode Comprehensive set which has better coverage than RefSeq (here is a related article) alone and would be as good as combining RefSeq and UCSC.

In fact if you visit UCSC Genome Browser, Gencode is now the default gene track.

ADD REPLY • link 8.6 years ago by Amitm ★ 2.3k

0

Entering edit mode

Thank you! I have seen the problem: now if I do differential expression analysis, I would get multiple expression values like FPKM values for only one gene, and this will make the DE result very misleading. For the GTF files, your suggest is that gencode comprehensive set had already cover all known/predict genes, then I don't need to consider refSeq and UCSC annotation anymore. Is that right?

ADD REPLY • link 8.6 years ago by syrttgump ▴ 50

0

Entering edit mode

Yes, thats right. I too have wasted time using GTF from UCSC and then facing the same problem: Transcript isoform level result from Cufflinks didn't make sense.

Anyways, use the Gencode Comprehensive set and you would have better coverage.

ADD REPLY • link 8.6 years ago by Amitm ★ 2.3k