merged transcripts from RNA de novo assembly to create a reference transcriptome
3
2
Entering edit mode
10.1 years ago
teixe005 ▴ 30

Hello,

I have created de novo assemblies from RNAseq reads using velvet/oases for different subjects at several time points. For every subject I have a merged file that has ~ 100,000 de novo transcripts that were created by merging other transcripts with different k-mer sizes. My ultimate goal is to perform differentially expressed analysis on this data set. The next step is to create a reference transcriptome that has all the transcripts from all subjects and time points, with no ambiguity, so I can map the de novo transcripts to the reference transcriptome that was created and quantify expression.

My question is in regards to a program that will merge all the transcripts from all subjects and time points and create a transcriptome that has just one copy of the same transcript and is also not missing any of the de novo transcripts that were found. Any suggestions? Is cd-hit a good option?

Thank you in advance for your help. I really appreciate it.

Assembly RNA-Seq • 8.0k views
ADD COMMENT
0
Entering edit mode

Are you working with a species that does not have a reference genome sequence?

ADD REPLY
0
Entering edit mode

I'm working with the equine genome. There is a reference genome but we know this reference has problems with assembly and annotation. This is the reason why we performed a de novo assembly of the RNA reads (using velvet/Oases), in addition to the reference based one (using Bowtie/TopHat for mapping followed by Cufflinks).

ADD REPLY
4
Entering edit mode
10.1 years ago

corset [software | paper] is much better at merging transcriptome assemblies than cd-hit-est. Specifically it is a tool for clustering contigs in a transcriptome assembly, but this makes it useful for merging, as demonstrated in the paper.

ADD COMMENT
1
Entering edit mode
10.1 years ago
Ram 44k

When I worked on my de novo transcriptome, we used cd-hit-est to cluster the merge assemblies from Velvet/Oases. It is one of the ways to go - the only one I know, in fact - but I was never completely comfortable with it. The technique is self referential and hence validation feels a bit quirky.

ADD COMMENT
1
Entering edit mode

Thank you very much for your comment. I'll merge them using cd-hit-est. We'll see how it goes.

ADD REPLY
1
Entering edit mode

As RamRS said it is little tricky especially to select the similarity cutoffs to merge the shorter transcripts. Reducing the similarity cutoff will merge the isoforms and paralogs and increasing the similarity cutoff would retain spurious contigs generated. So, we ran the cd-hit-est on at multiple cutoff's and conservatively chose the cutoff's where there is not a drastic falldown of the merged contigs. But still, this is not "the way" to carry on.

ADD REPLY
0
Entering edit mode

I agree. A similarity cut-off of ~90 is quite stringent, and 80 saw the number of contigs fall dramatically, in my case, that is. Without prior knowledge of the number of genes in the organism, gauging accuracy can be difficult.

ADD REPLY
0
Entering edit mode
10.0 years ago
h.mon 35k

This paper points to the EvidentialGene pipeline as providing a high quality merged transcriptome. I've used Corset and the results were a bit puzzling and did not follow the manual description, but I did not follow through.

ADD COMMENT

Login before adding your answer.

Traffic: 2025 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6