I've got this difficult RNA-seq de novo dataset, and despite removing contamination from my samples and performing the experiment to the best of my ability I'm still getting a BCV of 0.6 for my experiment. I've been told that this result is 'bad' and that I can't publish with a high BCV. Can someone comment on this? This species has a genome which is 3/4 complete, if i align my reads to that, my BCV is 0.2 with no prior filtering.
It seems in the de novo quite a proportion of my genes show variability across replicates and the samples are quite heterogeneous. We even conducted physiological measurements pre-experiment to ensure they were all at a suitable level of acclimation. All other parameters were tightly controlled to make the experiment stringent and fair.
RNA was extracted using uniform method, and at the same time to prevent batch effects. I have applied TMM in EdgeR as some library sizes were double of others , and used a cut-off of at least 1CPM in at least 3 samples for a gene to be taken forward for analysis.
I tried looking at the variable genes with low prior.df's; however they seem to be random genes and there's no obvious patterns emerging.
Any ideas on why the de novo has such a high BCV but the genome aligned version has a nice low value? The de novo is made of several assemblies merged including the genome model genes and clustered into non redundant transcripts.
Thanks.
Are you quantifying genes in one method and transcripts in the other? I imagine that quantifying transcripts with non-optimal methods will lead to higher BCVs.
Predicted gene models in genome guided. Evidential genes assembled de novo in the second. Any tips?
My suspicion is that this is some quirk of how the alignments and counts are working with your assembly. I wouldn't know what's going funky there, but that's where you should be looking.