MACS2 does not giving same result before and after chromosome extraction from bam file
0
0
Entering edit mode
3.5 years ago
buffealo ▴ 130

Hello everyone,

I am trying to do the same experiment on a sample firstly, wholly and then on a specific chromosome location of the sample.

My peak calling code is;

macs2 callpeak -t input.bam -c control.bam -g hs -f BAM --bdg --outdir /directory -n peakcall_samplename

I run it on my regular .bam files. Thank I run this code on the same .bam files(both sample and input);

First, index;

samtools index era_nt_filtered_sorted.bam.bam

Second, extract chromosome region;

samtools view -b -h input.bam chr1:125000000-249250621 > input_1q.bam

And then again;

macs2 callpeak -t input_1q.bam -c control_1q.bam -g hs -f BAM --bdg --outdir /directory -n peakcall_samplename_1q

However, when I count the peaks(wc -l .narrowPeak), I see that I lost almost 25% of the peaks in the file that has less peaks and 45% of the peaks in the file that has 40x more peaks than the first one, normally. Why could be the reason for that?

Thank you so much in advance.

chromosome samtools extract peakcalling macs2 • 1.2k views
ADD COMMENT
2
Entering edit mode

Your sentence about the file that has fewer peaks or more peaks is hard to make sense of. However, something to be aware of is that in the original MACS paper (describing MACS version 1), MACS does a global background estimate using the number of reads per chromosome from the input sample. If you're only giving reads for a portion of a given chromosome, it seems to me this estimate would be way off. Have you tried comparing the full analysis to an analysis with ALL of chromosome 1?

ADD REPLY
0
Entering edit mode

Thank you for the reply. So you mean that If I extract only 1 chomosome which has the portion I want to analyze, it could be healthier?

But how could I then use those files in downstream analyses?

Thank you again.

ADD REPLY
0
Entering edit mode

What is the rationale for limiting what portion of your data set MACS2 can use to calculate peaks? I can't think of a good reason to obscure your view of a data set in this way (except maybe brute force masking some awful portion of a novel genome?). Anyway, I would recommend giving MACS the entire data set, and if you want to move forward with peaks from only a certain region, use or learn tools for that. For instance, the .narrowPeak file is just a BED file, you can extract peaks in your region of interest using bedtools. Or you could use the GRanges framework in R to perform all your genomic interval manipulations.

ADD REPLY

Login before adding your answer.

Traffic: 1954 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6