I would like to know which is a good criteria to filter MACS called peaks to keep the most significant ones. I have aprox. 40000 peaks and I already filtered by a %FDR of 0.05 so I've been trying to filter by enrichment. I have tried to keep peaks with an enrichment e, 5 < e < 50, as I read those were the recommended values, though I still have what might be considered as too many peaks for my experiment. Is there another criteria I could use to filter peaks? or, can I be a little more restrictive with the filtering by enrichment?
If this is whole genome sequencing..... 40K peaks is not that much tbh. Visualise your peaks in IGV and you'll see they'll be pretty sparse. Ive recently used macs2 using a q value threshold of 0.001 - some samples still have 100K peaks.
Recommended values might not work for you as it worked for others. MACS2 is not out as a paper so optimal settings that are proposed is still a trial work. Benchmarking papers are out using it but mostly use default settings. If you have a particular setting of experimental design it should be dependent on your data. For q value or FDR MACS2 advised as 0.01 and the FE column you get in the output can be another threshold to use for. But all these are arbitrary. If you are only using MACS2 and not using other tools then best will be to filter the peaks with 0.01 FDR and among them plot the distribution for the FE (fold enrichment) and then select the ones that are above certain thresholds. Let's say the 1st quartile maybe if you get enough peaks with this kind of filter.
The idea of blacklist region is also advisable if it's there for your genome. It also depends on what your downstream analysis are concerned with. Good luck!
That's exactly the point. That might be a typical number of peaks for several experiments, but not for mine (at least I think so, according to the literature I have read).
I had already plotted the FE distribution, though I had a hard time deciding which cut off to apply. Now, I think I will use Ian's suggestion about known binding controls.
Thanks!
post removal of blacklist region and calling peaks with q value 0.01 if you have too many peaks. You can plot the dist. of FE> As you did. Now be it above the median or the top quartile or above the Q1 you will be able to know the number of peaks falling in that. You can decide based on that. Just plot for each cut-off number of peaks you receive and it will give you the score. Remember the FE in macs2 is log10 if am not wrong and to select higher the value in the MACS2 output refers to more enriched peaks so you should apparently select the ones having the highest FE here. I am taking into consideration these FE comes from peak calling of treated vs control and not differential peaks. Then the criterion will change. Hope its clear now! :)
Have you considered that there are many binding regions for the protein you are investigating, some factors are ubiquitous. The FDR cut off sounds reasonable, although MACS2 suggests a q-value of 0.01 initially. I usually use FE >=10 as a cut off, but it is arbitrary. The best way is to observe some known binding controls to see if they fall above or below your thresholds. You can also take your filtered results and filter against blacklist regions (if available for your genome). You might also consider the conservation of sequence under the region, presence of repetitive sequence, for example. Initially there is no harm in being very conservative with you cut offs (high FDR/FE), especially if your next step is motif discovery. You can then expand the set later.
I would suggest moving over to MACS2 if you haven't already.
I like the idea of using controls, I will try that to decide which threshold is appropiate. As for blacklisted regions, even though I made a QC with ChIPQC and only 3-4% of peaks where in blacklisted regions I filtered them out as well.
Thank you for your suggestions, they have been very helpful!
If this is whole genome sequencing..... 40K peaks is not that much tbh. Visualise your peaks in IGV and you'll see they'll be pretty sparse. Ive recently used macs2 using a q value threshold of 0.001 - some samples still have 100K peaks.
Recommended values might not work for you as it worked for others. MACS2 is not out as a paper so optimal settings that are proposed is still a trial work. Benchmarking papers are out using it but mostly use default settings. If you have a particular setting of experimental design it should be dependent on your data. For q value or FDR MACS2 advised as 0.01 and the FE column you get in the output can be another threshold to use for. But all these are arbitrary. If you are only using MACS2 and not using other tools then best will be to filter the peaks with 0.01 FDR and among them plot the distribution for the FE (fold enrichment) and then select the ones that are above certain thresholds. Let's say the 1st quartile maybe if you get enough peaks with this kind of filter.
The idea of blacklist region is also advisable if it's there for your genome. It also depends on what your downstream analysis are concerned with. Good luck!
That's exactly the point. That might be a typical number of peaks for several experiments, but not for mine (at least I think so, according to the literature I have read). I had already plotted the FE distribution, though I had a hard time deciding which cut off to apply. Now, I think I will use Ian's suggestion about known binding controls. Thanks!
post removal of blacklist region and calling peaks with q value 0.01 if you have too many peaks. You can plot the dist. of FE> As you did. Now be it above the median or the top quartile or above the Q1 you will be able to know the number of peaks falling in that. You can decide based on that. Just plot for each cut-off number of peaks you receive and it will give you the score. Remember the FE in macs2 is log10 if am not wrong and to select higher the value in the MACS2 output refers to more enriched peaks so you should apparently select the ones having the highest FE here. I am taking into consideration these FE comes from peak calling of treated vs control and not differential peaks. Then the criterion will change. Hope its clear now! :)
Yes, FE come from peak calling. Thank you very much!