kmer distribution tiny peak at low coverage
1
0
Entering edit mode
6.7 years ago
deepti1rao ▴ 50

jellyfish.histo file
fastqc file

I generated a kmer count file using jellyfish and subsequently a histogram, which when plotted in R gave the attached graph. I am confused about why I have a small peak at coverage 22. I see a similar tiny peak even for kmer values as high as 115.

  1. How does one interpret this for a genome expected to have 50-60% repeats.
  2. How can I extract reads pertaining to the tiny peaks?
  3. I am suspecting that I can correlate this with higher GC content in some reads, as you can see in the attached file generated by fastqc.
  4. Can I safely interpret this tiny peak as a non-erroneous peak and retain those kmers for assembly?
kmer distribution Assembly • 3.4k views
ADD COMMENT
0
Entering edit mode

Be more careful adding images please: the URL you use must point directly to the image. For example, you used: https://ibb.co/gFwTZc where you should have used: https://image.ibb.co/kNrHSx/per_sequence_gc_content.png

Right click on the image in the page (https://ibb.co/gFwTZc) and select Copy Image Address to get the actual image URL.

ADD REPLY
1
Entering edit mode

Better option is to click on the embed code tab at the bottom of the page and then copy full image HTML link and paste in the post (like below).

per sequence gc content

ADD REPLY
0
Entering edit mode
6.7 years ago
h.mon 35k

How can I extract reads pertaining to the tiny peaks?

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbnorm-guide/

bbnorm.sh in=reads.fq outlow=low.fq outmid=mid.fq outhigh=high.fq passes=1 lowbindepth=40 highbindepth=110

I am suspecting that I can correlate this with higher GC content in some reads, as you can see in the attached file generated by fastqc.

If by "correlate this" you mean the different kmer peaks, run FastQC again after splitting the reads into coverage bins. Then you can check each bin GC content.

ADD COMMENT
0
Entering edit mode

Thanks for that. I also want to extract the kmer sequences of low coverage, which have given rise to the tiny peak in the kmer distribution (attached). I want to check if they have a higher GC content.

ADD REPLY
0
Entering edit mode

Coverage bins here would split reads based on read coverage, or the kmer coverage?

ADD REPLY
0
Entering edit mode

kmer coverage over the length of a read - so it correlates highly with read coverage.

http://seqanswers.com/forums/showthread.php?t=49763

How does BBNorm work, and why is it better than other tools?

BBNorm counts kmers; by default, 31-mers. It reads the input once to count them. Then it reads the input a second time to process the reads according to their kmer frequencies. For this reason, unlike most other BBTools, BBNorm CANNOT accept piped input. For normalization, it discards reads with probability based on the ratio of the desired coverage to the median of the counts of a read's kmers.

ADD REPLY
0
Entering edit mode

I just need to know for sure that this tiny peak is genuine data and not erraneous, so that I can include those kmers when I set a kmer coverage cut off for genome assembly.

ADD REPLY

Login before adding your answer.

Traffic: 2768 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6