Question

remove over-represented sequences

0

Entering edit mode

9.7 years ago

wuzongze001 ▴ 10

I download data SRR1029258.sra (GSM1263454) from GEO and run fastq-dump to convert it to fastq. Then I use fastqc to do quality control. I found the quality of sequences is not high enough and the highest over-represented sequence percentage is about 29.9%. So I use command:

fastq_quality_trimmer -t 10 -l 40 -i SRR1029258.fastq -o trim_SRR1029258.fastq

The per base sequences quality is better but the highest over-represented sequence percentage turn to be 33.1%.

According to google, over-represented sequences may due to contaminated and may cause a wrong conclusion. Is there any way I can remove these reads?

ChIP-Seq next-gen-sequencing alignment • 9.0k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.7 years ago by wuzongze001 ▴ 10

Ram · Answer 1 · 2015-04-05

1

Entering edit mode

9.7 years ago

Devon Ryan 104k

Given that this is a ChIPseq experiment, over-represented sequences are expected. If you were to completely remove these, then you'd end up removing at least some of the real peaks.

ADD COMMENT • link 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you very much! I suddenly realize that.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.7 years ago by wuzongze001 ▴ 10

Ram · Answer 2 · 2015-04-05

0

Entering edit mode

9.7 years ago

Adrian Pelin ★ 2.6k

rRNA is a common overrepresented sequence, in some of the RNA-Seq reads i'm working with I have contamination ranging from 15% to 22%. The easiest thing to do is to blast your overrepresented sequence. Does FastQC say No Hit next to it?

ADD COMMENT • link 9.7 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

I blast it, top two hit both are rRNA. And the top five over-represented sequences percentage of my fastqc result are 33%, 4.5%, 3.6%,1.9%,0.6. I do not really understand "Does FastQC say No Hit next to it" , could you explain this?

ADD REPLY • link 9.7 years ago by wuzongze001 ▴ 10

0

Entering edit mode

FastQC has a limited database of things that it checks against, it won't be as thorough as blast.

ADD REPLY • link 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

Devon has a point, having the entire rRNA database in the package would make the software large in size and difficult to download. Although I must say, I haven't figured out how to easily cut and paste the sequences that are overrepresented, I usually have to save the report and then open the HTML, to copy the sequence.

ADD REPLY • link 9.7 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

Hi! I run fastqc and I got a bunch of over-represented sequences with "no hit" next to them... I'm not sure if these are adaptors or important biological information that should stay in there? I guess what I'm really asking is the following: what is the meaning of "no hit" in this context? I'm going to try removing them (using cutadapt) and see if the per-base quality improves (I was getting really low quality at the ends of reads). Any advice anyone could give me regarding this issue would be really appreciated. Cheers!

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.4 years ago by lawrence.mckechnie • 0

0

Entering edit mode

"The easiest thing to do is to blast your overrepresented sequence"

ADD REPLY • link 9.4 years ago by Adrian Pelin ★ 2.6k

Ram · Answer 3 · 2015-04-05

0

Entering edit mode

9.7 years ago

Gary ▴ 480

If the percentage of unique reads is too low, it could be caused by PCR bias during ChIP-Seq library preparation.

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.7 years ago by Gary ▴ 480