Hi,
I am using Tophat2 and Cufflinks for gene/transcript identification. I used reference genome for mapping RNA-Seq reads and later I used Cufflinks to generate the transcripts.gtf file. I generated the transcript sequences using following command:
gffread -w transcripts.fa -g Masked_for_Tophat.fa transcripts.gtf
Since in the Cufflinks transcripts.gtf file, we do not have CDS information so it is not possible to extract the CDS sequences using it. I got one tool TransDecoder which can generate CDS from the input transcript. Does anyone know how to generate CDS/Protein sequences from Cufflinks transcripts.gtf file?
In another analysis I want to train Augustus using this mapping information. For training augustus, I need to have CDS/Protein sequences. Although I used Augustus for gene prediction using intron/exon hints as mentioned here. I would appreciate your suggestions on this.
Best
Hi @R@hul, on the TransDecoder page, there is a separate section that deals with your exact situation, i.e. converting a
cufflinks.gtf
file into GFF3, extracting the transcripts, finding the longest ORFs (reported both as CDS and PEP sequences) and then generating a new GFF3 which reports these coding regions in the context of the genome.Here is a link to the relevant section: Starting from a genome-based transcript structure GTF file
Hi, I would like to know if you have figured out about annotating a transcripts.gtf file generated by cufflinks.