Weird 1kb-periodic artifact in ChIP-Seq data (including input)
0
1
Entering edit mode
8.2 years ago
Ryan Thompson ★ 3.6k

I am doing a ChIP-Seq analysis of this GEO data set. I was the data analyst on the original publication for that data set, which was done on hg19. I have now re-aligned the samples to hg38, and I'm using the csaw Bioconductor package to analyze the data. (Edit: These are single-end sequecing samples.) I've used the csaw functions findMaxima and profileSites to generate an average profile of the coverage around windows that represent local maxima in coverage, and noticed an odd pattern in the resulting profile plots:

https://www.dropbox.com/s/0fb7i7ata5d5dly/site-profile-plots.pdf?dl=0

You can view the code that I used to generate these plots here. This code mostly follows the csaw user's guide with regard to usage of findMaxima and profileSites.

In some, but not all, samples, including the input samples, there is a periodic spike in the coverage every 1000 bp. These spikes all appear roughly the same width as the main spike at 0. Since this is histone ChIP-Seq data, periodicity resulting from adjacent histones is not unexpected, but this does not explain a period of exactly 1000 bp. I'm not sure whether this artifact is the result of a software error or a problem with my data. The perfect regularity of the interval seems to suggest something software-related, but the inconsistency of the height of the spikes, and the fact that they are not present in all samples, suggest that the spikes are a property of the data.

When I generated the same plots based on the original alignment to hg19, I did not see these periodic spikes at all (see here). I did see some irregular spikes that were not consistent between samples, and most of these spikes disappeared when I filtered out reads overlapping the published excludable "blacklist" regions. These tracks are not available for hg38, so I used the liftOver tool to convert them to hg38 coordinates. I found that these were not sufficient after lifting over, so I supplemented them with my own "grey list" based on excessively high-coverage regions in the input samples from my data, using this code. The plots linked above were made after filtering out all reads in either the lifted-over blacklist or my own grey list.

So, does anyone have an explanation for this artifact, or has anyone encountered a similar result before?

ChIP-Seq • 2.4k views
ADD COMMENT
1
Entering edit mode

Do you see this if you look at the raw data in IGV? My guess is that this is some weird edge-case bug in the code. If you write the maxima sites to a BED file then try computeMatrix followed by plotProfile from deepTools to see if a different tool shows the same thing.

ADD REPLY
0
Entering edit mode

Are these pair-end reads?

ADD REPLY
1
Entering edit mode

These are single-end reads.

ADD REPLY
1
Entering edit mode

Were the profiles generated strictly from the read mappings? Did you shift the reads according to mean fragment length or something?

ADD REPLY
0
Entering edit mode

The reads were counted as the 5-prime ends of 147-bp fragments (i.e. the length of 1 nucleosome worth of DNA).

ADD REPLY
0
Entering edit mode

Having never used csaw, but looking at the findMaxima function, I see that there is a range parameter and it is set to "1000" for the call. It seems awfully conincidental that your peaks in the flanking regions are also spaced at 1Kb intervals... Have you tried varying the window and range parameters to see if the same artifact shows up but with different spacing?

As an aside, it is curious that your input has the same meta-profile as the IP's.

ADD REPLY
0
Entering edit mode

I'm running findMaxima with range=5000 and profileSites with range=10000.

ADD REPLY

Login before adding your answer.

Traffic: 2102 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6