I download data SRR1029258.sra (GSM1263454) from GEO and run fastq-dump to convert it to fastq. Then I use fastqc to do quality control. I found the quality of sequences is not high enough and the highest over-represented sequence percentage is about 29.9%. So I use command:
fastq_quality_trimmer -t 10 -l 40 -i SRR1029258.fastq -o trim_SRR1029258.fastq
The per base sequences quality is better but the highest over-represented sequence percentage turn to be 33.1%.
According to google, over-represented sequences may due to contaminated and may cause a wrong conclusion. Is there any way I can remove these reads?
Thank you very much! I suddenly realize that.