This might have more to do with assembly parameters but by visual inspection of my chip-seq data it seems like a lot reads mapping near centromere and telomeric regions. I was wondering if there was a way to remove these reads before calling peaks because I am getting a lot of false peaks at these regions.
I am using bwa to align (default settings)
Thanks
Here is a picture of what it looks like:
full image
The top two tracks are input,
The next 3 tracks are chip-seq results, red is microsattlite, next 3 are MACS bed output
Are you using a input DNA (or other) control set of reads in your peak calling tool? Quite often such regions appear in both samples, so negating each other.
I am using input DNA for peak calling but they still call the peaks, some peaks with very very high input are ignored so it seems to be working but other peaks w/high input still are called.
First, the centromeres are not included in the reference human genome assembly. Te are unclonable and unsequencable by traditional genome sequencing methods. This is why you have a blank patch of no reads mapping in the center chromosome 7 (a metacentric -- "middle center" -- chromosome). The centromere is a physical gap in the assembly and is represented as a long stretch of NNNN's.
What you are actually seeing is reads preferentially mapping to the peri-centormeric region. In many species, the pericentromeric region is enriched in transposable elements and simple sequence repeats, because these regions are gene-poor and/or have low recombination rates. Therefore, what is probably going on here is that you have reads mapping to repetitive sequences in the peri-centromeric region. Why this would happening with BWA on defaults, which should map repetitive reads randomly to one of the repeats, is a bit curious. Neverlethess, to see if this solves your problem, try filtering to remove all repetitive reads on the XT tag of the BWA output.
I added a grep -v "XT:A:R" to my pipe and it took out a bit of the noise around pericentromeric regions. removing these reads seems to do better than samtools rmdup. Using samtools -q 30 since it is low coverage i pretty much lose all my peaks but the pericentromeric region is also all gone. Hope this info helps someone too
I am expanding a little bit Caseys explanation.
You see the enrichment there because the official assembly contains them just once - where they are still divergent enough - but in the actual sequenced genome its present multiple times. Then all of these get mapped to the same location uniquely (or just once randomly - but better check your settings again), and you see this accumulation.
Since its just 23 chromosomes I would create a catalogue of regions where you see extreme enrichment in the input manually (or with IGB you can create a track above a certain treshhold) and filter the reads first by these regions before I put them into a peak finder (e.g. bedtools: intersectBed -a reads.bed -b regions.bed -v). This will ensure that reads from these regions don't screw up the statistics (e.g. background distribution, total number of reads, "highest signals").
What is strange is that your input tracks seem to have higher enrichment than your IPs. Something is wrong there (experimentally).
This is not enough. I think you have multiple technical issues there:
a) The ENCODE input looks a bit strange. But this can be just visual - from the aggregation method and display or from the mapping (repeats on/off). One would have to zoom in in some regions to check if you have amplification artifacts there. There are some enrichments that seem to get higher with Dex - which should not happen with an "input sample". But chromosome scale is difficult to interpret.
b) Did you check a positive control in your data yet? Do you see an enrichment in a region you know there should be something? Is this TF or histone? I don't really see enrichment appropriate for most TFs that I know off at this scale.
Could you elaborate more on why the ENCODE looks strange? The ENCODE track is A439 cell line w/DEX and ETOH. Since this is a low coverage preliminary run, none of the positive controls look like they have peaks but we are in the process of validating some of the peaks right now. We ran chip-seq on a non-DNA binding TF. Removing the reads with XT:A:R tag removes some of regions with high input so it is possible that the ENCODE data did not filter out the high input regions but I have not gone back to check this yet.
I am repeating myself: as pointed out, one would have to zoom in to really check how these enrichments look like, but it seems as if some of them are responding to DEX, which an input track should not. An input track can have enrichments at repetitive elements and some other regions, but here it looks like the enrichment correlates with gene dense regions. Your own IP track looks like the ideal input track, rather flat with some occasional peaks (on this scale). I think you should invest in your own input track (it does not have to be +/- DEX).
I don't know your current depht, but the first Chip-Seq papers were published with 4-5 million reads, far from saturation as we now know, but still good enough to detect many peaks. You should a) check your samples with direct PCR if you see an enrichment b) think about wheter you really expect to see an enrichment if you sequence the same samples deeper or if you rather prepare them again.
i didnt see that you replied, there are enrichment differences in sample between DEX and non-DEX treatment from encode and that is a bit troubling but I dont know what to make of it (there should be no differences) the chip-seq data i am showing is preliminary just to see if the antibody worked so low # of reads (mid-3M reads off the sequencer and low-2M map after QC and processing) we are in the process of checking some of the reads w/chip-qpcr and pending the sucess of that we plan on running more reads (30-40M range)
How do you know the peaks are false? Have you tried playing with the -n parameter of bwa samse? (i.e. choose -n 1)
Are you using a input DNA (or other) control set of reads in your peak calling tool? Quite often such regions appear in both samples, so negating each other.
I am using input DNA for peak calling but they still call the peaks, some peaks with very very high input are ignored so it seems to be working but other peaks w/high input still are called.