Hello,
Could someone explain me clearly why do we set cutoffs coverage in kmer distribution in order to improve assemblies?
And so, how to determine these cutoffs?
Best
Hello,
Could someone explain me clearly why do we set cutoffs coverage in kmer distribution in order to improve assemblies?
And so, how to determine these cutoffs?
Best
K-mers may be in low abundance because they occur rarely in the genome, and in addition to that were not sequenced many times. A more likely explanation is that rare k-mers come from sequencing errors. You can probably find a statistical proof for that by Googling, but it should be pretty intuitive that k-mers that occur only once or twice are more likely to come from sequencing errors than be real.
Cutoffs are chosen such that we exclude as many k-mers as possible that result from sequencing errors. At the same time, we don't want to throw away the reads with truly rare k-mers. The exact number is determined from k-mer distribution and overall sequencing coverage.
This paper may help:
https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-5272-y
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks a lot, that helps! I asked that because I used the tool
purged_dups
to improve my assembly. One step ofpurge_dups
pipeline is estimating these cutoffs , stored in a file like that :5 13 21 25 42 75
I understand the first one (5) is to remove the kmers associated with the sequencing errors, and the last one to remove the kmers associated with high coverage = repeats. Do you know why do we need the 4 others,?I mean, only the first and last ones could be enough?I don't know the exact answer to your question because I never used that tool. My guess is that this is akin to significance thresholds that are used to reject the null hypothesis. While 0.05 is good enough by most standards, the confidence will be greater if it goes below 0.01. If we apply that logic, the first cutoff at 5 would remove the majority of sequencing errors. If you wanted an assembly that is even more accurate at the expense of being less complete, you'd go for the next higher cutoff.