Can anybody guide me if there is a standard threshold for DNAse-seq accessibility (ON and OFF regions) based on peak tag density?
I am analyzing Roadmap's DNAse-seq data and for the final annotation of genes I have to differentiate genes between accessible and not-accessible. In processed DNAse-seq data from Roadmap the columns with peak tag density is mentioned but nowhere in the publication and supporting material, I found any information about the cutoff.
The short answer is: there is no standard threshold.
Because these types of experiments are performed on a population of cells at one time point, the data represents a single snapshot result of a very complex system. In theory, chromatin should only exist in two states: open or closed. But in practice, there are lots of sites which are half-open. This could mean that they are open in only half the cells in the population, or that they are bound by a transcription factors with different binding affinities, or a whole host of other unknown factors.
I've personally found that the best way to validate a set of regions is to do it experimentally. For example, generate a list of DNAse HS sites, sort by descending tag-counts and experimentally test points on the scale until I find the threshold where validation begins to fail.
The quick and dirty way would be to rank by tag-counts and set an arbitrary cut-off value that pleases you. This is often a sliding scale which is context-dependent. It's a widely accepted practice.
Thanks for the answer, can you point me out towards a specific study using some cutoff for this purpose?
Secondly, "sort by descending tag-counts and experimentally test points on the scale until I find the threshold where validation begins to fail" in this case how would you experimentally test the points? by relevant Chip-seq data??
*For DNase-seq data... MACSv2.0.10 was also used to call narrow peaks using the same settings specified above for the histone mark narrow peak calling.
Narrow peaks and broad domains were also generated for the unconsolidated, 36-bp mappability filtered histone mark ChIP-seq and DNase-seq Release 9 data sets using MACSv2.0.10 with the same settings as specified above.
Peak calling. For the histone ChIP-seq data, the MACSv2.0.10 peak caller was used to compare ChIP-seq signal to a corresponding whole-cell extract (WCE) sequenced control to identify narrow regions of enrichment (peaks) that pass a Poisson P value threshold 0.01, broad domains that pass a broad-peak Poisson P value of 0.1 and gapped peaks which are broad domains (P < 0.1) that include at least one narrow peak (P < 0.01) (https://github.com/taoliu/MACS/)32. Fragment lengths for each data set were pre-estimated using strand cross-correlation analysis and the SPP peak caller package (https://code.google.com/p/phantompeakqualtools/)37 and these fragment length estimates were explicitly used as parameters in the MACS2 program (–shift-size = fragment_length/2).*
Here's a paper where the authors attempted to validate DNAse HS sites by RT-PCR:
Many thanks for replying. Sorry for my naivety. I have read this roadmap paper but I cant understand where they are mentioning that we selected a threshold (let's say tag density > 50) for defining DNAse-accessible regions !! I am considering narrowPeaks from MACS 2.0 proccessed DNAse-seq (available here) and I have different peaks with different tag density, now what threshold I select to call those regions as accessible or not accessible, based on the peak tag density.
Secondly, in cases where you get more that 1 mappings to a same gene with different tag counts, in such cases do people take average of all or just keep the one with highest tag density?
In that case, they haven't used tag-density per se as the cutoff. They've used a P-Value cutoff. This calculation is related to tag density but includes other factors.. I'd suggest you follow the P-Value cutoff if you're going to use an arbitrary cutoff score, rather than tag density.
The (-log10) P-values are in column 8 of the narrow peak file.
Secondly, in cases where you get more that 1 mappings to a same gene with different tag counts (or p-values), in such cases do people take average of all or just keep the one with highest tag density (or lowest p-value)?
The column 8 you mentioned for -log10 p-value, in one of the cases contains values ranging from 197.2 to 2.0, I am not getting what does it means. Is it that all these peaks are below threshold of 0.01 or it is something else I have to do?
In that peak file there are 4 mappings (different nearby regions annotated against same gene) against one gene with different p-values. Which one should I consider? one with the lowest p-value or I should take a mean of all p-values?
I have total ~56000 peaks having p-values ranging from 197.2 to 2.0, How will I set the threshold of FDR 0.01 for differentiating between accessible and un-accessible genes?
Thanks for the answer, can you point me out towards a specific study using some cutoff for this purpose? Secondly, "sort by descending tag-counts and experimentally test points on the scale until I find the threshold where validation begins to fail" in this case how would you experimentally test the points? by relevant Chip-seq data??
Here's an example from the methods section of a RoadMap paper (http://www.nature.com/nature/journal/v518/n7539/full/nature14248.html#methods) with arbitrary cut-off values.
*For DNase-seq data... MACSv2.0.10 was also used to call narrow peaks using the same settings specified above for the histone mark narrow peak calling.
Narrow peaks and broad domains were also generated for the unconsolidated, 36-bp mappability filtered histone mark ChIP-seq and DNase-seq Release 9 data sets using MACSv2.0.10 with the same settings as specified above.
Peak calling. For the histone ChIP-seq data, the MACSv2.0.10 peak caller was used to compare ChIP-seq signal to a corresponding whole-cell extract (WCE) sequenced control to identify narrow regions of enrichment (peaks) that pass a Poisson P value threshold 0.01, broad domains that pass a broad-peak Poisson P value of 0.1 and gapped peaks which are broad domains (P < 0.1) that include at least one narrow peak (P < 0.01) (https://github.com/taoliu/MACS/)32. Fragment lengths for each data set were pre-estimated using strand cross-correlation analysis and the SPP peak caller package (https://code.google.com/p/phantompeakqualtools/)37 and these fragment length estimates were explicitly used as parameters in the MACS2 program (–shift-size = fragment_length/2).*
Here's a paper where the authors attempted to validate DNAse HS sites by RT-PCR:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1356136/
There are probably newer, better methods out there.
Many thanks for replying. Sorry for my naivety. I have read this roadmap paper but I cant understand where they are mentioning that we selected a threshold (let's say tag density > 50) for defining DNAse-accessible regions !! I am considering narrowPeaks from MACS 2.0 proccessed DNAse-seq (available here) and I have different peaks with different tag density, now what threshold I select to call those regions as accessible or not accessible, based on the peak tag density. Secondly, in cases where you get more that 1 mappings to a same gene with different tag counts, in such cases do people take average of all or just keep the one with highest tag density?
In that case, they haven't used tag-density per se as the cutoff. They've used a P-Value cutoff. This calculation is related to tag density but includes other factors.. I'd suggest you follow the P-Value cutoff if you're going to use an arbitrary cutoff score, rather than tag density.
The (-log10) P-values are in column 8 of the narrow peak file.
Secondly, in cases where you get more that 1 mappings to a same gene with different tag counts (or p-values), in such cases do people take average of all or just keep the one with highest tag density (or lowest p-value)?
The column 8 you mentioned for -log10 p-value, in one of the cases contains values ranging from 197.2 to 2.0, I am not getting what does it means. Is it that all these peaks are below threshold of 0.01 or it is something else I have to do?
I'm not quite sure what you mean by more than 1 mapping. Could you elaborate?
That value means 10^(-197.2).