Question

Best method to assemble low-abundance transcripts?

1

Entering edit mode

7.7 years ago

molly77 ▴ 10

Hi,

There are a few parts to this post. I am determining the max or most optimal number of cDNA libraries I should add to an Illumina Nextseq flowcell (400M reads). I currently have 12 libraries, two different treatments per 3 different species and each sample has a biological replicate. This leaves me at 33.33M reads per library, however, I am interested in adding a third biological replicate for one of my species, bringing my library count to 14, and 28.57M reads/library. I will be generating 6 unique assemblies (3 unique species that each have two treatments). My questions are:

Am I running the risk of missing out on some low-abundance transcripts by reducing my sequencing coverage from 33M to 25.5M reads? Or is that too small of a difference to even worry about? I will be doing expression studies so I'd rather focus on the number of replicates than sequence depth.
Is there a method to optimize the assembly of these potential low-abundance transcripts that may be of interest to me? I am aware of all of the various transcriptome assemblers. Various kmer lengths from various assemblers, and merge assemblies?
I was planning on combining the reads from each replicate to generate the assembly de novo. Are there any consequences/tradeoffs to doing this? My samples are from outbred non-model species. I've reduced as many variables as I can to hopefully decrease potential sequence polymorphism. Is it better to first generate the assembly and then map my reads from my replicate to it, and subsequently merge the two assemblies? Generate both assemblies de novo and merge those?

If you're still reading this, thank you!

assembly rna-seq illumina transcriptomics • 1.9k views

ADD COMMENT • link updated 7.7 years ago by Matteo Schiavinato ★ 3.7k • written 7.7 years ago by molly77 ▴ 10

score 1 · Answer 1 · 2018-02-08

1

Yes, you are, because their appearance in your data set is strongly correlated with the sequencing depth. Since you are doing expression studies later on, though, I wonder why you are actually interested in low coverage transcripts. RNASeq is completely unreliable in that territory, so it would be even better if you didn't stress your attention on those, to avoid false positive results.

2 and 3

I suppose the only "method" is to increase your sequencing depth. Rare transcripts have low chances to appear in small samples, but the chances increase the deeper you sequence. As for methods, I don't think that different methods can increase dramatically your chances to assemble them, if the read length and the depth are fixed. What in the end helps reads assembly is read length, so you can increase the kmer size. To this end, maybe give a try to tadpole.sh from the BBMap suite (I am sure that Brian Bushnell will comment on this post suggesting you his tool, haha, this time me first :D). This module has a nice feature that enables you to "extend" the reads so that you can use bigger kmer sizes. Worked for me.