Hello everyone,
I am trying to do the same experiment on a sample firstly, wholly and then on a specific chromosome location of the sample.
My peak calling code is;
macs2 callpeak -t input.bam -c control.bam -g hs -f BAM --bdg --outdir /directory -n peakcall_samplename
I run it on my regular .bam files. Thank I run this code on the same .bam files(both sample and input);
First, index;
samtools index era_nt_filtered_sorted.bam.bam
Second, extract chromosome region;
samtools view -b -h input.bam chr1:125000000-249250621 > input_1q.bam
And then again;
macs2 callpeak -t input_1q.bam -c control_1q.bam -g hs -f BAM --bdg --outdir /directory -n peakcall_samplename_1q
However, when I count the peaks(wc -l .narrowPeak), I see that I lost almost 25% of the peaks in the file that has less peaks and 45% of the peaks in the file that has 40x more peaks than the first one, normally. Why could be the reason for that?
Thank you so much in advance.
Your sentence about the file that has fewer peaks or more peaks is hard to make sense of. However, something to be aware of is that in the original MACS paper (describing MACS version 1), MACS does a global background estimate using the number of reads per chromosome from the input sample. If you're only giving reads for a portion of a given chromosome, it seems to me this estimate would be way off. Have you tried comparing the full analysis to an analysis with ALL of chromosome 1?
Thank you for the reply. So you mean that If I extract only 1 chomosome which has the portion I want to analyze, it could be healthier?
But how could I then use those files in downstream analyses?
Thank you again.
What is the rationale for limiting what portion of your data set MACS2 can use to calculate peaks? I can't think of a good reason to obscure your view of a data set in this way (except maybe brute force masking some awful portion of a novel genome?). Anyway, I would recommend giving MACS the entire data set, and if you want to move forward with peaks from only a certain region, use or learn tools for that. For instance, the .narrowPeak file is just a BED file, you can extract peaks in your region of interest using bedtools. Or you could use the GRanges framework in R to perform all your genomic interval manipulations.