Hello,
I need an advice to how reduce the depth of a sequencing, because when I used the workflow and I had a huge bacterial genome of more than 10000 genes for a genome usually of 6000 genes. This organism was sequenced with Illumina MiSeq, the reads were 300 nt (2x300) so R1 was 1,873,799 and R2 1,873,799.I calculated the coverage was approximately 224x.
I need to reduce it to 100x or near of it. I used trimmomatic and spades to work it. I'm open to any software or tool to have it done. Thank you!
Have you tried diginorm? You're not exactly asking about digital normalization, but that will likely help in a similar manner.
First off, what kind of experiment is this? Single-cell, isolate, resequencing a known organism, etc?
224x is not excessive and should assemble fine with Spades. You can reduce the coverage if you want, but the issue is more likely something else, like low-quality data, given 2x300bp reads... or contamination. I suggest you BLAST the assembly against nt and RefSeq bacterial and see what it hits; also, plot a PCA chart of the tetramers in the assembly and see if you see multiple distinct clouds.
You can also plot a kmer-frequency histogram to look for additional unwanted peaks, using the BBMap package, like this:
Plot the khist file on a log-log scale, and/or look at the peaks file.There should be one major peak at around 224x and few small higher-order peaks. An additional peak below ~224x would indicate contamination.
You can also map to the assembly with BBMap to get a gc-vs-coverage plot:
With contamination, you will generally get two populations of contigs, with very different coverage and possibly different GC as well. If the problem is contamination, it should be fairly easy to fix in this case using mapping or depth-binning.
If not, normalization and more robust quality-trimming may help. So, post back with what you discover.
I also wouldn't reccomend just lowering coverage. Quality filtration and some reduction in excessive coverage regions, yes (local assembly maybe?).
But if you still want to do it, the simplest way when you already have aligned BAM files is to use samtools:
(integer value is the seed, after the dot is the fraction of reads to subsample)