How I can reduce the depth of a sequencing when I want to assemble it?
4
0
Entering edit mode
9.1 years ago
ZnaeW • 0

Hello,

I need an advice to how reduce the depth of a sequencing, because when I used the workflow and I had a huge bacterial genome of more than 10000 genes for a genome usually of 6000 genes. This organism was sequenced with Illumina MiSeq, the reads were 300 nt (2x300) so R1 was 1,873,799 and R2 1,873,799.I calculated the coverage was approximately 224x.

I need to reduce it to 100x or near of it. I used trimmomatic and spades to work it. I'm open to any software or tool to have it done. Thank you!

sequencing Assembly genome • 2.6k views
ADD COMMENT
0
Entering edit mode

Have you tried diginorm? You're not exactly asking about digital normalization, but that will likely help in a similar manner.

ADD REPLY
0
Entering edit mode

First off, what kind of experiment is this? Single-cell, isolate, resequencing a known organism, etc?

224x is not excessive and should assemble fine with Spades. You can reduce the coverage if you want, but the issue is more likely something else, like low-quality data, given 2x300bp reads... or contamination. I suggest you BLAST the assembly against nt and RefSeq bacterial and see what it hits; also, plot a PCA chart of the tetramers in the assembly and see if you see multiple distinct clouds.

You can also plot a kmer-frequency histogram to look for additional unwanted peaks, using the BBMap package, like this:

kmercountexact.sh in1=read1.fq in2=read2.fq khist=khist.txt peaks=peaks.txt

Plot the khist file on a log-log scale, and/or look at the peaks file.There should be one major peak at around 224x and few small higher-order peaks. An additional peak below ~224x would indicate contamination.

You can also map to the assembly with BBMap to get a gc-vs-coverage plot:

bbmap.sh ref=assembly.fa in1=read1.fq in2=read2.fq covstats=covstats.txt

With contamination, you will generally get two populations of contigs, with very different coverage and possibly different GC as well. If the problem is contamination, it should be fairly easy to fix in this case using mapping or depth-binning.

If not, normalization and more robust quality-trimming may help. So, post back with what you discover.

ADD REPLY
0
Entering edit mode

I also wouldn't reccomend just lowering coverage. Quality filtration and some reduction in excessive coverage regions, yes (local assembly maybe?).

But if you still want to do it, the simplest way when you already have aligned BAM files is to use samtools:

samtools view -s 0.1

(integer value is the seed, after the dot is the fraction of reads to subsample)

ADD REPLY
1
Entering edit mode
9.1 years ago

If you are getting a lot more genes (how did you determine this?), or a much longer genome than you expect, I doubt it is because of a high coverage. It sounds more like a contamination issue. I've seen assemblies of Listeria contaminated with Salmonella, and typically the size of the assembly is around 8 Mbp (3+5). If you plot GC content against your assembly, it should be quite easy to detect (except if you're unlucky enough to have a contamination with a species with the same GC content).

ADD COMMENT
0
Entering edit mode
9.1 years ago
I don't believe that you will get a better assembly by lowering coverage. Assembly is compromised in determined genomes if only plain sequencing is done. I assume that the quality of your reads are good enough or at least you filtered them by quality A different strategy is needed, as the inclusion of mate paired and/or long sequences (454, PacBio or long Illumina sequences) or a comparison with a trusted reference genome if available You don't mention the assembler you used and the settings, and this is also import. For example, if using DE Bruijn graphs, the choice of kmer is crucial
ADD COMMENT

Login before adding your answer.

Traffic: 3661 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6