Hello,
I am processing a ChIP-seq experiment I downloaded from GEO (link). The SRA files are massive (39M sequences). It took me a while processing them. Briefly, I did SRA to fastQ format using fastq-dump then concatenated the 2 fastQ files with cat and ran fastQC a first time. I discover that the reads contained an adapter.
I ran Trim Galore! to remove the adapter.
Adapter 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGTCGGATATCTCGTAT', length 50, was trimmed 13590580 times.
Then I ran fastQC on the trimmed fastQ file and obtained the following results.
As you can see the results are not ideal.
The "per sequence content" and "per base GC" look odd. I am thinking to trim the end of the read ends. Any comment on that is welcome.
The per sequence GC show a mix distribution. Did anybody encountered that? Could it comes from the two fastQ file I concatenated (kind of a batch effect)??
Lastly, because we are dealing with ChIP-seq sequences here the reads are not completely random and contain over-represented motifs. So I assume the warning in "sequence duplication level" and "Kmer content" are not relevant. Is this assumption correct?
Thanks for the hints. I will look into quality trimming. In our pipeline we keep duplicate reads but allow only unique matching at the alignment step.
Mmm, sounds good - that's actually my preferred solution as well.
I found out that the quality trimming is performed by default. I am re-running it with a higher threshold.