I work on Mosquitoes which have notoriously bad annotations in their UTRs. I have a bunch of RNA-seq data that I want to use to help me better define the UTRs in the genes being expressed in my tissues. This is important because I want to study the putative promoter regions. As it is now, MANY regions 5' to the annotated start of the transcript are either UTR or intron due to the gene model missing the first exon (an all UTR exon).
My plan is to use a few de novo transcript assembly programs to come up with putative cDNA transcripts, then to map these to the genomes to 'annotate' the gene structures and then merge those into the official gtf annotations. For my specific use, the most important things are the border regions (start/stop of transcription), but I plan to publish the amended annotations in an effort to make the work more reproducible. And if they are going to be published, I would like the "guts" to be as accurate with regard to splicing as is reasonably possible.
I was planning to use exonerate, but in the last few weeks, I have read about other options such as using GMAP or simply using BLAT.
Is there a feeling of which mapper/method produces the best results at this time?
Thanks.
I am not sure if I understood correctly but isn't it other way round? i.e. you can map the RNA-seq reads on the genome (e.g. using TOPHAT) and then assemble the transcripts (e.g. using Cufflinks)
You can do that. BUT: cufflinks has no knowledge of biological information (like splice sites have a probabilistic structure) so it frequently gets things wrong. Also its rules for linking near-by transcript islands to known genes causes, in my experience, a ton of bizarre results.
I want to assemble the transcripts de novo so that I can try many settings and assemblers to maximize the number of real(ish) transcripts, then use aligners/mappers with full length transcripts (more reliable than short reads for obvious reasons) preferably with biological understanding of splice sites to get the most accurate amendments to the official annotation as possible. After that, I will merge the transcript alignment coordinates with the official annotation gtf.
THEN, this merged gtf will serve as the 'reference' gtf as I run tophat/cufflinks/cuffmerge/cuffdiff. Is that more clear?
I think in both cases - whether assembling transcripts using cufflinks or de novo assembly, we will get false positives. Also, I do not understand how de novo assembly can maximize the number of real(ish) transcripts. If you are interested in promoter regions, you can try to gather CAGE data and if you are also interested in 3' ends, then you can also try to gather SAGE data. I was reading this paper some time ago where they use both CAGE and SAGE data for better annotation.