Question

How To Extract Cds And Protein Sequences From Cufflinks Transcripts.Gtf File?

6

Entering edit mode

11.0 years ago

Rahul Sharma ▴ 660

Hi,

I am using Tophat2 and Cufflinks for gene/transcript identification. I used reference genome for mapping RNA-Seq reads and later I used Cufflinks to generate the transcripts.gtf file. I generated the transcript sequences using following command:

gffread -w transcripts.fa -g Masked_for_Tophat.fa transcripts.gtf

Since in the Cufflinks transcripts.gtf file, we do not have CDS information so it is not possible to extract the CDS sequences using it. I got one tool TransDecoder which can generate CDS from the input transcript. Does anyone know how to generate CDS/Protein sequences from Cufflinks transcripts.gtf file?

In another analysis I want to train Augustus using this mapping information. For training augustus, I need to have CDS/Protein sequences. Although I used Augustus for gene prediction using intron/exon hints as mentioned here. I would appreciate your suggestions on this.

Best

cufflinks rna-seq cds • 14k views

ADD COMMENT • link updated 9.6 years ago by wanziyi89 ▴ 60 • written 11.0 years ago by Rahul Sharma ▴ 660

1

Entering edit mode

Hi @R@hul, on the TransDecoder page, there is a separate section that deals with your exact situation, i.e. converting a cufflinks.gtf file into GFF3, extracting the transcripts, finding the longest ORFs (reported both as CDS and PEP sequences) and then generating a new GFF3 which reports these coding regions in the context of the genome.

Here is a link to the relevant section: Starting from a genome-based transcript structure GTF file

ADD REPLY • link 10.9 years ago by Vivek Krishnakumar ▴ 400

0

Entering edit mode

Hi, I would like to know if you have figured out about annotating a transcripts.gtf file generated by cufflinks.

ADD REPLY • link 9.7 years ago by GouthamAtla 12k

score 2 · Answer 1 · 2015-05-05

2

Entering edit mode

9.6 years ago

wrf ▴ 70

I'm not sure there is a one-step solution to that. The PASA pipeline includes a script to extract transcripts from cufflinks.gtf, called "cufflinks_gtf_genome_to_cdna_fasta.pl"

http://pasapipeline.github.io

CDS/peptides can be generated from the transcripts as suggested above with TransDecoder.

ADD COMMENT • link 9.6 years ago by wrf ▴ 70

0

Entering edit mode

Thanks, this answer helped me a lot even though my problems was slightly different. Just as a note - the output from this script includes both the transcript_id (TCONS) and the gene_id (XLOC) together in the fasta header from the cufflinks .gtf file.

ADD REPLY • link 7.9 years ago by Dan Powell • 0

score 0 · Answer 2 · 2015-05-06

0

Entering edit mode

9.6 years ago

wanziyi89 ▴ 60

Hi,

Can TransDecoder annotate 5" UTR and 3'UTR as well?

regards,

ADD COMMENT • link 9.6 years ago by wanziyi89 ▴ 60