Question

kmer distribution tiny peak at low coverage

0

Entering edit mode

6.7 years ago

deepti1rao ▴ 50

I generated a kmer count file using jellyfish and subsequently a histogram, which when plotted in R gave the attached graph. I am confused about why I have a small peak at coverage 22. I see a similar tiny peak even for kmer values as high as 115.

How does one interpret this for a genome expected to have 50-60% repeats.
How can I extract reads pertaining to the tiny peaks?
I am suspecting that I can correlate this with higher GC content in some reads, as you can see in the attached file generated by fastqc.
Can I safely interpret this tiny peak as a non-erroneous peak and retain those kmers for assembly?

kmer distribution Assembly • 3.4k views

ADD COMMENT • link updated 6.7 years ago by h.mon 35k • written 6.7 years ago by deepti1rao ▴ 50

0

Entering edit mode

Be more careful adding images please: the URL you use must point directly to the image. For example, you used: https://ibb.co/gFwTZc where you should have used: https://image.ibb.co/kNrHSx/per_sequence_gc_content.png

Right click on the image in the page (https://ibb.co/gFwTZc) and select Copy Image Address to get the actual image URL.

ADD REPLY • link 6.7 years ago by Ram 44k

1

Entering edit mode

Better option is to click on the embed code tab at the bottom of the page and then copy full image HTML link and paste in the post (like below).

ADD REPLY • link 6.7 years ago by GenoMax 147k

score 0 · Answer 1 · 2018-03-16

0

Entering edit mode

6.7 years ago

h.mon 35k

How can I extract reads pertaining to the tiny peaks?

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbnorm-guide/

bbnorm.sh in=reads.fq outlow=low.fq outmid=mid.fq outhigh=high.fq passes=1 lowbindepth=40 highbindepth=110

I am suspecting that I can correlate this with higher GC content in some reads, as you can see in the attached file generated by fastqc.

If by "correlate this" you mean the different kmer peaks, run FastQC again after splitting the reads into coverage bins. Then you can check each bin GC content.

ADD COMMENT • link 6.7 years ago by h.mon 35k

0

Entering edit mode

Thanks for that. I also want to extract the kmer sequences of low coverage, which have given rise to the tiny peak in the kmer distribution (attached). I want to check if they have a higher GC content.

ADD REPLY • link 6.7 years ago by deepti1rao ▴ 50

0

Entering edit mode

Coverage bins here would split reads based on read coverage, or the kmer coverage?

ADD REPLY • link 6.7 years ago by deepti1rao ▴ 50

0

Entering edit mode

kmer coverage over the length of a read - so it correlates highly with read coverage.

http://seqanswers.com/forums/showthread.php?t=49763

How does BBNorm work, and why is it better than other tools?

BBNorm counts kmers; by default, 31-mers. It reads the input once to count them. Then it reads the input a second time to process the reads according to their kmer frequencies. For this reason, unlike most other BBTools, BBNorm CANNOT accept piped input. For normalization, it discards reads with probability based on the ratio of the desired coverage to the median of the counts of a read's kmers.

ADD REPLY • link 6.7 years ago by h.mon 35k

0

Entering edit mode

I just need to know for sure that this tiny peak is genuine data and not erraneous, so that I can include those kmers when I set a kmer coverage cut off for genome assembly.

ADD REPLY • link 6.7 years ago by deepti1rao ▴ 50