Question

Optimizing De Novo Transcriptome assemblies for non model organisms

2

Entering edit mode

9.7 years ago

giorgiocasaburi ▴ 90

Hi all,

I have to analyze 24 transcriptomes (TRM) in order to compare gene expression in different conditions of an animal, which genome has not been annotated. I thought about a multiple assembly followed by a co-assembly in order to build the "main" TRM. After quality filtering, I was thinking to:

Assemble the 24 libraries (they came from different treatments) using X different assemblers (i.e Trinity, velvet, ../ multiple K-mer, etc.). This will give me 'X' x 24 assemblies.
Merge together 'X' x 24 assemblies with a co-assembly tool (i.e. CD-HIT-EST or CORSET or CAP3). Therefore, I will end up having one main transcriptome (Main-TRM), representing the animal object of the study.
Performing functional annotation using the Main-TRM against SWISS-Prot, KEGG, GO ,etc. using blastx and blast-to-go.
Tacking the non-assembled quality filtered reads from the 24 libraries (before step 1, in order to retain the condition variable) and blast them individually against the annotated Main-TRM, having this way the expression information.

What do you guys think about this approach? Is it theoretically correct, if not what should I change?

Thanks a lot in advance,

~Giorgio

Assembly rna-seq • 3.9k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by giorgiocasaburi ▴ 90

0

Entering edit mode

Thanks Reema for your suggestions. Yes I was planning on diversify Trinity throughput using different parameters.

ADD REPLY • link 9.7 years ago by giorgiocasaburi ▴ 90

0

Entering edit mode

Also, this should have been a comment, not an "answer". I don't think I can move it for you, just saying.

ADD REPLY • link 9.7 years ago by Madelaine Gogol 5.3k

0

Entering edit mode

Hi Madelaine,

Thanks for your input. Yes I was thinking for step 4. to use BOWTIE and using the output (.bam) file to estimate transcription level abundance for each library using RSEM (RSEM: accurate transcript quantification from RNA-Seq data). I just have to figure out the best way to combine the 24 results for statistical purposes.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by giorgiocasaburi ▴ 90

0

Entering edit mode

Good point Pyperl, thank you! So you are suggesting to just merge all the libraries in one file and then run the assembly with different assemblers and then merge those together, is that right? I still would prefer doing the co-assembly in edfort of multiple assemblers used and to reduce redundancy.

ADD REPLY • link 9.7 years ago by giorgiocasaburi ▴ 90

0

Entering edit mode

Here, I would prefer Trinity for performing the assembly as it is de novo assembly. There is no need to use different assembler as it will consume your time and efforts. But, still you are curious to compare the output of different assembler then you can go ahead and can compare the diagnostics among different assembly.

ADD REPLY • link 9.7 years ago by Renesh ★ 2.2k

0

Entering edit mode

It's not really about comparison, it's more about having a co-assembly derived from multiple assemblies (i.e. using Trinity but with different k-mer). Several papers suggest a co-assembly step after generating different assemblies with more tools or within the same tool but having used different parameters (e.g. k-mer). hope that makes sense.

ADD REPLY • link 9.7 years ago by giorgiocasaburi ▴ 90

0

Entering edit mode

I think trinity have fixed k-mer size (25) and this is optimal across different transcriptomes as per trinity developer.

ADD REPLY • link 9.7 years ago by Renesh ★ 2.2k

0

Entering edit mode

Yes I will be using a HPC obviously.

ADD REPLY • link 9.7 years ago by giorgiocasaburi ▴ 90

0

Entering edit mode

Hi, is there any update on this ? Have you reached annotation part ? I am having similar kind of data and would like to know if you have some summary on this.

ADD REPLY • link 9.7 years ago by GouthamAtla 12k

0

Entering edit mode

Hi, not yet I'm waiting for other data, will update when I finish some of the initial steps.

ADD REPLY • link 9.7 years ago by giorgiocasaburi ▴ 90

0

Entering edit mode

Hi. I need to do something similar. Let me know how your analysis goes. In my case I have draft genome and very few annotations.

ADD REPLY • link 9.7 years ago by GouthamAtla 12k

Ram · Answer 1 · 2015-03-06

Hello Giorgio,

In my view, you should give this approach a try. Also you can start with simple approach as well. Like generate multiple assemblies at different parameters first. Then compare them on the basis of quality, completeness(comparing them with the existing/similar genome), CEGMA score. Just a suggestion from my own experience - In case you use Trinity - try using different kmer parameters and coverage.

Best,
Reema

score 0 · Answer 2 · 2015-03-06

I don't know that much about assembly, but your first few sound reasonable... But on the last step:

"4. Tacking the non-assembled quality filtered reads from the 24 libraries (before step 1, in order to retain the condition variable) and blast them individually against the annotated Main-TRM, having this way the expression information."

I would just use all the reads for each transcriptome and align them to your main assembly using tophat or something.

Also, I would probably just stick with Trinity. Seems like combining the results of multiple assemblers could be confusing and introduce more errors.

score 0 · Answer 3 · 2015-03-06

If you want to make whole combine transcriptome from different libraries, then you should not assemble all libraries differently.Because it will create lot of duplicate transcripts as all of the libraries are from same organims. You should combine all libraries in one file and perform assembly to save your further complicated tasks. Obviously, this will be computationally expensive, you will need HPC for performing this.