I'm trying to recreate an analysis of ChIP-seq data from SRP051788.
I used bowtie2 for mapping and then I used HOMER to generate peak files and am using these to annotate using the hg19 genome using the following command:
annotatePeaks.pl peaks.txt hg19 > annotated_peaks.txt
It runs fine but when I look at the file there are no annotations and no gene IDs and I can't figure out why.
Here is the first couple lines of a peak file I used for annotation:
# HOMER Peaks
# Peak finding parameters:
# tag directory = Bcatenin_WNT3a/
#
# total peaks = 14292
# peak size = 200
# peaks found using tags on both strands
# minimum distance between peaks = 400
# fragment length = 156
# genome size = 2000000000
# Total tags = 15622676.0
# Total tags in peaks = 1025145.0
# Approximate IP efficiency = 6.56%
# tags per bp = 0.007238
# expected tags per peak = 1.448
# maximum tags considered per bp = 1.0
# effective number of tags used for normalization = 10000000.0
# Peaks have been centered at maximum tag pile-up
# number of putative peaks = 14620
#
# input tag directory = H1_Input/
# Fold over input required = 4.00
# Poisson p-value over input required = 1.00e-04
# Putative peaks filtered by input = 214
#
# size of region used for local filtering = 10000
# Fold over local region required = 4.00
# Poisson p-value over local region required = 1.00e-04
# Putative peaks filtered by local signal = 114
#
# Maximum fold under expected unique positions for tags = 2.00
# Putative peaks filtered for being too clonal = 0
#
# cmd = findPeaks Bcatenin_WNT3a/ -style factor -size 200 -tagThreshold 30 -o auto -i H1_Input/
#
# Column Headers:
#PeakID chr start end strand Normalized Tag Count focus ratio findPeaks Score Total Tags (normalized to Control Experiment) Control Tags Fold Change vs Control p-value vs Control Fold Change vs Local p-value vs Local Clonal Fold Change
GL000220.1-1 GL000220.1 134369 134569 + 2451.6 0.854 374.000000 1985.5 44.0 45.12 0.00e+00 19.91 0.00e+00 0.54
GL000220.1-2 GL000220.1 124783 124983 + 905.7 0.760 370.000000 733.5 44.0 16.67 0.00e+00 9.49 0.00e+00 0.54
2-1 2 171571590 171571790 + 338.0 0.860 275.000000 273.7 0.5 547.43 0.00e+00 26.05 0.00e+00 0.67
1-2 1 33202423 33202623 + 275.2 0.806 246.000000 222.9 0.5 445.82 0.00e+00 83.94 0.00e+00 0.71
4-1 4 140545690 140545890 + 274.0 0.864 246.000000 221.9 0.5 443.75 0.00e+00 50.41 0.00e+00 0.70
I've tried adding 'chr' to the start of the chrom start column since it looked like that was the way the annotation file was set up but that didn't fix the problem. I've also tried to delete the header since I read that might fix the problem. a problem with peak annotation after ChIP-seq
I'll try some other annotation programs and to get peak files from MACS2 but since I am new to doing ChIP-seq analysis I was hoping to recreate some other researchers' pipelines before I get my own data back for analysis, so I'd like to try to get HOMER to work also.
reran findPeaks without specifying -size or -tagThreshold 30
Then I reran annotatePeaks and still got no annotation/genes:
It's this study, right? - https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE64758
Just to check: are you sure that you're not processing one of the GRO-seq samples with the ChIP pipeline? I can see that you were following the Supplementary Methods, too.
GL000220.1 is just a relatively short human contig, by the way, relating to ribosomal RNA I believe. I don't know why it comes up here.
You may have to paste your full pipeline in order for someone to pick up what may have gone wrong here.
Yes that's correct, that is the study I'm looking at!
No I'm not using the GRO-seq samples. I've only downloaded the following TF ChIP-seq fastq files for analysis, just trying to recreate what was already done (and input):
SRR2037028 SRR2037029 SRR1745491 SRR1745492 SRR1745492 SRR1745493 SRR174594
Sorry, I'm since I'm pretty new to this type of analysis could you please clarify by what you mean by my full pipeline? This would be my guess:
I've just used the supplemental methods as my guide basically. Bowtie2 for mapping, HOMER findPeaks to call peaks with default -style factor parameters except with 200 bp peaks and peaks with 30 or more reads. Then I tried to use annotatePeaks but haven't been able to get any annotations or genes in my annotated peak file.
Oh, my apologies, by pipeline, I just meant to paste the actual commands that you have used from the very start up to the
annotatePeaks
command. Nothing new, by the way, the following the HOMER procedure can be tricky because it's such a comprehensive tool.Edit: that includes how you install cache for the reference genome of interest, if you have that (hg19)
I made the index for hg19 awhile ago so I don't really remember how I did that. Do you think that could be the problem?
I forgot to mention that I put the bam file into IGV and the peaks look the same as the bigwig files that the authors posted.
Oh, hang on, I am not sure if this will resolve the problem but can you add the following steps after your alignment step:
The PCR duplicate steps are most likely not necessary just for testing; however, sorting your SAM file, indexing it, and saving as BAM cannot do any harm and may resolve the issue.
You should probably do some FASTQ trimming prior to alignment, too, but it's not entirely necessary. It's possible that the FASTQ files given by the authors are already QC'd.
Thanks for the quick reply. Just to clarify: after sorting and removing PCR duplicates I should use the sorted bam files to make the tag directories to call peaks against?
Yep, that's right. These kind of intermediary steps are not typically mentioned in manuscript methods. I'm not sure if HOMER requires the sorted BAM but it could just be what is causing the problem. Your alignment metrics otherwise look great.
Also just to confirm: you should process all samples (including the input controls) in this way, unfortunately.
For now, the duplicates step really isn't necessary just to figure out if this is the issue, so, you could easily just do:
I've just tried creating tag directories with the sorted and indexed bam files and using these to call peaks using findPeaks, then used the peaks file to annotate using annotatePeaks but have the same problem, none of the peaks are being annotated.
Okay, the only other glaring omission is the indication in the methods by the authors to only use 'uniquely mapped' reads. Bowtie2 does not have an implicit parameter that allows for the inclusion of just "uniquely mapped" reads; however, you can virtually ensure unique mapping by setting the MAPQ threshold high, as also recommended by Devon here: How to extract unique mapped results from Bowtie2 bam results? (see also here: Mapping quality: higher = more unique)
If there are still issues, you may consider contacting the author, whose email is on the GEO page (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE64758)
I tried to set the MAPQ thereshold high for the mapping as you recommended and go through the same pipeline but again there were no annotations.
I went back again and tried to change the chromosome column from the Homer peak file (column 2) from just "numbers" to chr"numbers" as it is in the annotation files and it worked. I thought I had tried this before and it didn't work but I must have done something differently this time. Thanks for all your help.
In case anyone else runs into this problem I wanted to include the commands I used to change my Homer peak file (adapted from https://unix.stackexchange.com/questions/148114/how-to-add-words-to-an-existing-column):
lol - okay, that would have been the first thing that I'd have asked (adding 'chr'), but you mentioned how you had tried it. Either way, glad that you got it sorted out.
The same happens to me! Just realized the bed file chromosomes do not have "chr". Thanks for figuring this out!