remove over-represented sequences
3
0
Entering edit mode
9.7 years ago
wuzongze001 ▴ 10

I download data SRR1029258.sra (GSM1263454) from GEO and run fastq-dump to convert it to fastq. Then I use fastqc to do quality control. I found the quality of sequences is not high enough and the highest over-represented sequence percentage is about 29.9%. So I use command:

fastq_quality_trimmer -t 10 -l 40 -i SRR1029258.fastq -o trim_SRR1029258.fastq

The per base sequences quality is better but the highest over-represented sequence percentage turn to be 33.1%.

According to google, over-represented sequences may due to contaminated and may cause a wrong conclusion. Is there any way I can remove these reads?

ChIP-Seq next-gen-sequencing alignment • 9.0k views
ADD COMMENT
1
Entering edit mode
9.7 years ago

Given that this is a ChIPseq experiment, over-represented sequences are expected. If you were to completely remove these, then you'd end up removing at least some of the real peaks.

ADD COMMENT
0
Entering edit mode

Thank you very much! I suddenly realize that.

ADD REPLY
0
Entering edit mode
9.7 years ago
Adrian Pelin ★ 2.6k

rRNA is a common overrepresented sequence, in some of the RNA-Seq reads i'm working with I have contamination ranging from 15% to 22%. The easiest thing to do is to blast your overrepresented sequence. Does FastQC say No Hit next to it?

ADD COMMENT
0
Entering edit mode

I blast it, top two hit both are rRNA. And the top five over-represented sequences percentage of my fastqc result are 33%, 4.5%, 3.6%,1.9%,0.6. I do not really understand "Does FastQC say No Hit next to it" , could you explain this?

ADD REPLY
0
Entering edit mode

FastQC has a limited database of things that it checks against, it won't be as thorough as blast.

ADD REPLY
0
Entering edit mode

Devon has a point, having the entire rRNA database in the package would make the software large in size and difficult to download. Although I must say, I haven't figured out how to easily cut and paste the sequences that are overrepresented, I usually have to save the report and then open the HTML, to copy the sequence.

ADD REPLY
0
Entering edit mode

Hi! I run fastqc and I got a bunch of over-represented sequences with "no hit" next to them... I'm not sure if these are adaptors or important biological information that should stay in there? I guess what I'm really asking is the following: what is the meaning of "no hit" in this context? I'm going to try removing them (using cutadapt) and see if the per-base quality improves (I was getting really low quality at the ends of reads). Any advice anyone could give me regarding this issue would be really appreciated. Cheers!

ADD REPLY
0
Entering edit mode

"The easiest thing to do is to blast your overrepresented sequence"

ADD REPLY
0
Entering edit mode
9.7 years ago
Gary ▴ 480

If the percentage of unique reads is too low, it could be caused by PCR bias during ChIP-Seq library preparation.

ADD COMMENT

Login before adding your answer.

Traffic: 2225 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6