Question

Illumina Chip-seq run failing Kmer content module in FastQC

0

Entering edit mode

10.1 years ago

James Ashmore ★ 3.5k

I have some Illumina data from a transcription factor Chip-seq study, both before and after quality trimming the sample fails the Kmer content module. Here is a graph of the distribution over read length: https://www.dropbox.com/s/vj7lqzb3ou6ycjn/Screenshot%20from%202014-10-17%2011%3A03%3A06.png?dl=0

note: I tried embedding the image but it won't show up in the preview

At first I thought it may be the binding motif for the transcription factor, however comparing the canonical binding motif to the consensus sequence made up of the kmers shows it is not. Any suggestions what these spikes are?

ChIP-Seq FastQC • 4.4k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by James Ashmore ★ 3.5k

0

Entering edit mode

We do not have permission to open your URL.

ADD REPLY • link 10.1 years ago by Coryza ▴ 430

0

Entering edit mode

Updated with new link, should be visible now.

ADD REPLY • link 10.1 years ago by James Ashmore ★ 3.5k

0

Entering edit mode

I have this problem with one of my input samples - the IP samples will not yield any peaks if I use this input, but do when I used a different input. I will appreciate your feedback on how to filter this Kmer sequences

ADD REPLY • link 8.4 years ago by ATCG ▴ 400

Ram · Accepted Answer · 2014-10-17

4

Entering edit mode

10.1 years ago

Istvan Albert 102k

With this plot what you need to evaluate is the table under the plot. This will list the actual counts of observed enrichments. Often these counts are very low yet still come up as significant.

On the plot itself each kmer enrichment is scaled to 100 independently so even small deviations can look more concerning than what they actually are.

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Istvan Albert 102k

0

Entering edit mode

Not ever having done this before, could you quantify low counts? Attached is the table of counts, and they all appear to be in the hundreds / low-thousands (https://www.dropbox.com/s/yg3ouyuexnqag6q/Screenshot%20from%202014-10-17%2016%3A02%3A15.png?dl=0). I guess given the total number of reads is in the millions then these are low counts?

ADD REPLY • link 10.1 years ago by James Ashmore ★ 3.5k

0

Entering edit mode

What you need to learn about bioinformatics is that YOU have to make that determination not someone else.

Is 1000 low if you have 1 million reads? What do you think? In fact this might be one of the easiest of the subjective questions that you will encounter so best get your practice in.

What is also important to note that the kmers could be overlapping and thus reported multiple times. A 11 bp long kmer that is present 1000 times when reported as a 6mer will be reported five times each time shifted by one base and will look like 5000 instances when in fact it was only 1000.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Istvan Albert 102k

0

Entering edit mode

Interesting, in that respect this single kmer is in fact being reported multiple times as different shorter kmers, and thus isn't really a concern. Thank you Istvan.

ADD REPLY • link 10.1 years ago by James Ashmore ★ 3.5k