Dear community,
I am designing a differential gene expression experiment in a non-model animal without a reference genome and I need your expertise.
My experiment has 2 different conditions in 2 different locations and I will have 3 replicates per condition. It is also a time-scale experiment with 4 different time points. Total of samples = ((2x3)x2)x4 = 48 samples.
My initial idea was to multiplex 12 samples per ILMN HiSeq lane at 50PE.
Since I don't have a reference I have to generate my own de novo transcriptome and here is where I need your help. I don't know if 50PE is going to be enough to generate a good reference transcriptome and maybe is better to go for 100PE in at least 2 lanes to have a better coverage.
What do you think? Of course, I have budget restrictions so I cannot sequence my 32 samples at 100PE.
Thanks for your help!
Hi Chris, thanks for your point. The higher rate of mapping back sounds normal, I also saw in my work, but the mapping rate difference between a broad-based reference assembly and self-generated assembly is usually 10-15% for me. However, what you have experienced may also occur. Could you please let me know how you deal with these conditions when you have a good transcriptome assembly in terms of annotation that is not very informative for gene expression analysis as you mentioned?
It never hurts to run test alignments to the reference assembly to get an idea how serious a problem it may be, but I always suggest running a de novo assembly + QC/filtering + RSEM as well; you can always map back or cluster to the original transcriptome if needed to get a rough idea, though I would also suggest running Trinotate. With modern versions of assemblers (e.g. Trinity) and digital normalization a typical de novo trx assembly doesn't take very long anymore; the bottleneck is then access to hardware (which in our case isn't an issue).
It really comes down to how much you trust that reference assembly and how well they compare. I have unfortunately run into too many instances where someone suggests using a reference trx assembly from lab X or pub Y, but when we've delved into how the assembly was made we find it problematic in some way (poorly documented methods, older seq technology, poor quality samples, shorter reads, made from SE data, not strand-specific, annotation is old or generated in a hard-to-determine way, filtered in an obscure way, should have use
--jaccard_clip
, etc). In one case I requested the reference assembly and got back the unigenes from a Trinity assembly (generated via tgicl); when asked they mentioned that was all that was provided, so isoform information was pretty much lost.