Question

Interpreting Fastqc Results For Chip-Seq Data

2

Entering edit mode

12.0 years ago

David ▴ 740

Hello,

I am processing a ChIP-seq experiment I downloaded from GEO (link). The SRA files are massive (39M sequences). It took me a while processing them. Briefly, I did SRA to fastQ format using fastq-dump then concatenated the 2 fastQ files with cat and ran fastQC a first time. I discover that the reads contained an adapter.

I ran Trim Galore! to remove the adapter.

Adapter 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGTCGGATATCTCGTAT', length 50, was trimmed 13590580 times.

Then I ran fastQC on the trimmed fastQ file and obtained the following results.

As you can see the results are not ideal.

The "per sequence content" and "per base GC" look odd. I am thinking to trim the end of the read ends. Any comment on that is welcome.

The per sequence GC show a mix distribution. Did anybody encountered that? Could it comes from the two fastQ file I concatenated (kind of a batch effect)??

Lastly, because we are dealing with ChIP-seq sequences here the reads are not completely random and contain over-represented motifs. So I assume the warning in "sequence duplication level" and "Kmer content" are not relevant. Is this assumption correct?

chipseq chip-seq fastqc sequencing • 8.3k views

ADD COMMENT • link updated 12.0 years ago by Ido Tamir 5.2k • written 12.0 years ago by David ▴ 740

score 1 · Answer 1 · 2013-05-17

1

Entering edit mode

12.0 years ago

Jelena Aleksic ▴ 920

I agree with your conclusions. I would trim the last 10-20 or so base pairs - it looks a little bit like there's some sort of end of an adapter there, that hasn't been trimmed. They're pretty long sequences, so you should still have plenty of data left after this. Additionally, I'd also perform quality trimming to get rid of a few of the lower quality sequence reads (Trim Galore can do this).

The sequence duplication levels actually look completely fine to me, especially for ChIP-seq data, but I think that some people remove the duplicates, so this is something you could consider doing if you're worried about it? I don't know if it's bad etiquette to point people to a different help forum? But there was quite a long and in-depth discussion of duplicate removal in this thread at SEQanswers.

ADD COMMENT • link 12.0 years ago by Jelena Aleksic ▴ 920

1

Entering edit mode

Thanks for the hints. I will look into quality trimming. In our pipeline we keep duplicate reads but allow only unique matching at the alignment step.

ADD REPLY • link 12.0 years ago by David ▴ 740

0

Entering edit mode

Mmm, sounds good - that's actually my preferred solution as well.

ADD REPLY • link 12.0 years ago by Jelena Aleksic ▴ 920

0

Entering edit mode

I found out that the quality trimming is performed by default. I am re-running it with a higher threshold.

ADD REPLY • link 12.0 years ago by David ▴ 740

score 1 · Answer 2 · 2013-05-17

1

Entering edit mode

12.0 years ago

Ido Tamir 5.2k

Seems like you pasted the two read files together not "cat" as you said, or the SRA conversion messed something up. Else you would not get reads with a length of 250. You also see the drop + rise in qualities at 130 (125).

I would go back to square one and start again.

ADD COMMENT • link 12.0 years ago by Ido Tamir 5.2k

0

Entering edit mode

Yes, the GEO page says that they used Illumina Genome Analyzer so the reads shouldn't be 250 bp.

ADD REPLY • link 12.0 years ago by Mikael Huss 4.8k

0

Entering edit mode

Thanks guys I will look into that. How does using:

cat file1.fq file2.fq > file.fq

can paste reads together?

ADD REPLY • link 12.0 years ago by David ▴ 740

0

Entering edit mode

maybe it was

  paste file1.fq file2.fq > file.fq

The read names would be also concatenated in this case.

ADD REPLY • link 12.0 years ago by Ido Tamir 5.2k