Question

De Novo Transcriptome And Comparative Genomics

0

Entering edit mode

13.0 years ago

Am • 0

Hello all, First time posting here. I am looking at doing some comparative genomics with a mammalian transcriptome (no genome available). We have Illumina GAII 76bp paired end library, I have assembled the transcriptome with Trinity, blasted to refseq, and annotated using BLAST2GO. I am looking for suggestions (or best tool) to take me through the next step. I would like to align this transcriptome (using MUSCLE or PRANK) to multiple species. I am wondering if I need to make a consensus transcriptome that is non-redundant that can be used for down stream analyses or if there is another way. Any advice would be very much appreciated. Thank you in advance.

rna-seq comparative genomics • 6.8k views

ADD COMMENT • link updated 13.0 years ago by Vitis ★ 2.6k • written 13.0 years ago by Am • 0

0

Entering edit mode

What do you mean by align to multiple species? Are these complete assembled transcriptome? How many transcripts and reads do you have, that will influence the tools that can be used?

ADD REPLY • link 13.0 years ago by Joseph Hughes ★ 3.0k

0

Entering edit mode

Thanks for your response. I have ~16 million paired end reads that have been assembled into ~82,000 contigs (n50=870). As for completion of the transcriptome, this was a de novo assembly, and I believe I have decent coverage. My end goal is to run PAML, using the coding sequences of my newly assemble transcriptome, along with about 5 other species. To run PAML, the sequences for each species need to be aligned in a program, such as MUSCLE or PRANK. My problem is that I may have multiple contigs for one coding sequence in my de novo assembled transcriptome, alignment to other species coding sequences may get messy. So, I was thinking one way around this would be to assemble a consensus transcriptome, with 1 contig per "gene" or coding sequence. But of course if there is another route, I am open to suggestions.

ADD REPLY • link 13.0 years ago by Am • 0

score 4 · Answer 1 · 2012-04-20

We have addressed exactly that question in a study that will be published soon. You can find a preprint of the paper on our program's (called PAGAN) home page at http://code.google.com/p/pagan-msa/wiki/PAGAN?tm=6.

Our approach was to use existing reference alignments and trees (e.g. Ensembl GeneTrees), infer the sequence history for the reference alignments and then "insert" new sequences/fragments into the reference alignments by aligning them against the most similar target sequences. Importantly, the target sequences can be either extant sequences or ancestral sequences, the latter being inferred using a phylogeny-aware algorithm similar to that of PRANK.

A big advantage of using reference alignments (and not single reference sequences) is the additional phylogenetic information coming from multiple sequences; this is especially helpful if the new sequences come from a species that has no close relatives with genome sequences available. An additional advantage of using alignments of gene families (such as Ensembl GeneTrees) is that one can separate fragments coming from close paralogues: in addition to aligning the fragments to the reference alignment, PAGAN can connect fragments placed to the same paralogue to longer contigs.

In fact, when starting the project we were thinking of the use of RNA-seq data for comparative evolutionary analyses. As a result, the first version of PAGAN assumed that one always knows from which species the data come from and that the phylogenetic positions for the fragments are constrained. Often that is not the case and we later implemented the necessary functions to search for the optimal placement. This seems to work fine and PAGAN can also be used for metagenomic studies of fairly large datasets.

In our paper we focus on sequence placement (or alignment extension) and show that PAGAN handles well fragments of very different length and evolutionary divergence. We tested PAGAN with DNA and protein data. To some extent it also supports translated alignment. Please contact me if you are interested to know more about that.

score 1 · Answer 2 · 2012-04-20

I think you probably want to do a 'mini' assembly: align your de novo contigs to a closely related taxon for which the proteins are well annotated, then put your contigs together around each protein. This would help in two ways: assemble the contigs that come from the same gene based on the protein 'reference', i. e. dealing with the fragmentation of de novo assembly, and collapse the redundant contigs for the same gene, which is exactly what you want. There might be existing pipelines to do this, but you can always glue tools like blastx, cap3, etc. together with perl and have your own workflow.