I have discovered new transcripts via TUXEDO RNAseq pipeline. After running cuffmerge I have got a GTF file with genomic coordinates (Ensembl build 37) of individual exons of discovered isoforms. I want to verify their existence by PCR approach. Designing isoform - specific primer sets means I need their sequences. Can anybody advise on how to obtain full transcript sequences? Note that I am not worried about getting these sequences directly from RNAseq reads (not sure if that's possible or not; if possible an explanation of how will be appreciated though) but rather want to reconstruct them using GTF annotation and reference genome sequence in high throughput manner. The high throughput aspect is important.
For making a reference sequence from your dataset, search the forum for "BAM consensus sequence". Common tools for this are bcftools consensus and FastaAlternateReferenceMaker from GATK. However, I'll warn you that going this route is likely to be a total nightmare for you, unless you stick only to SNPs (i.e., just forget about including indels or any sort of complex variant.
For making the actual fasta files, tophat actually comes with a convenient program call gtf_to_fasta. It accepts a GTF file and a reference genome and will make a fasta file containing transcript sequences from it. This is typically used by tophat internally when people give it a reference annotation, but there's no reason you can't use it too. I can't say I've ever looked at the output file, so I can't say how conveniently it names the various transcripts.
If you prefer R, you can use GenomicFeatures with the extractTranscriptSeq() command.
Thanks Devon... that's useful... For R I'm a newbie so I'll try the gtf_to_fasta feature of TopHat instead. Fingers crossed, it's going to work. Thanks!
Thanks Devon... that's useful... For R I'm a newbie so I'll try the gtf_to_fasta feature of TopHat instead. Fingers crossed, it's going to work. Thanks!