Hi everyone!
I am working with NGS data, and I have reads ranging from 30 to 250 bp. If I run fastqc on my fastq file without clipping, my over-represented sequences are fine, in that the most over-represented sequence is 16% with no source hit. But, when I clip the sequences removing the first 30 nucleotides, The highest percentage, 42%, of over-represented sequences are sourced as an ABI Solid Adapter contaminant.
In the fastqc documentation at http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/9%20Overrepresented%20Sequences.html, I have read the following:
"Because the duplication detection requires an exact sequence match over the whole length of the sequence any reads over 75bp in length are truncated to 50bp for the purposes of this analysis."
Does this mean that fastqc is not looking at the part of my reads ranging from 75 to 250bp? If that is not the case, what else could be occurring to cause the abundance or over-represented reads to be affiliated with the ABI Solid Adapter in the clipped sequences?
thanks for reading,
How was the data generated?
The data comes from 4C-seq. Basically, for a known chromosomic position, we crosslink the region that its interacting with the this known region. Then, this is digested with two restriction enzymes and it is sequenced. In this case, it was sequenced in an Ion Torrent machine
And you find ABI Solid Adapter in there? Isn't that rather remarkable?
yeah, my question is why fastqc misses that sequence when I run my fastq file without removing the first 30 nucleotides. My guess is that it should appear in as an overrepensented sequence before removing the first 30 nucleotides
I'd say that 16% is quite a lot, but perhaps that's typical for 4C-seq (which I'm unfamiliar with).