Question

fastqc overrepresented sequences

1

Entering edit mode

8.8 years ago

IP ▴ 780

Hi everyone!

I am working with NGS data, and I have reads ranging from 30 to 250 bp. If I run fastqc on my fastq file without clipping, my over-represented sequences are fine, in that the most over-represented sequence is 16% with no source hit. But, when I clip the sequences removing the first 30 nucleotides, The highest percentage, 42%, of over-represented sequences are sourced as an ABI Solid Adapter contaminant.

In the fastqc documentation at http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/9%20Overrepresented%20Sequences.html, I have read the following:

"Because the duplication detection requires an exact sequence match over the whole length of the sequence any reads over 75bp in length are truncated to 50bp for the purposes of this analysis."

Does this mean that fastqc is not looking at the part of my reads ranging from 75 to 250bp? If that is not the case, what else could be occurring to cause the abundance or over-represented reads to be affiliated with the ABI Solid Adapter in the clipped sequences?

thanks for reading,

fastqc RNA-Seq • 4.6k views

ADD COMMENT • link 8.8 years ago by IP ▴ 780

0

Entering edit mode

How was the data generated?

ADD REPLY • link 8.8 years ago by WouterDeCoster 48k

0

Entering edit mode

The data comes from 4C-seq. Basically, for a known chromosomic position, we crosslink the region that its interacting with the this known region. Then, this is digested with two restriction enzymes and it is sequenced. In this case, it was sequenced in an Ion Torrent machine

ADD REPLY • link 8.8 years ago by IP ▴ 780

0

Entering edit mode

And you find ABI Solid Adapter in there? Isn't that rather remarkable?

ADD REPLY • link 8.8 years ago by WouterDeCoster 48k

0

Entering edit mode

yeah, my question is why fastqc misses that sequence when I run my fastq file without removing the first 30 nucleotides. My guess is that it should appear in as an overrepensented sequence before removing the first 30 nucleotides

ADD REPLY • link 8.8 years ago by IP ▴ 780

0

Entering edit mode

I'd say that 16% is quite a lot, but perhaps that's typical for 4C-seq (which I'm unfamiliar with).

ADD REPLY • link 8.8 years ago by WouterDeCoster 48k