When I'm working with Mouse ChIP-seq data, I normally remove mapped reads which overlap the ENCODE blacklist regions. Previously there was no data for the mm10 assembly, so instead I would lift over the coordinates from the mm9 assembly (as described in the F1000 csaw article). When I do this I get 3,010 regions. However, recently ENCODE created a dataset for the mm10 assembly, but it only contains 164 regions. I contacted ENCODE to ask why and this was their response:
"LiftOver is not a good strategy for transferring blacklists across assemblies. Note that the blacklists are regions that show artifacts due to deficiencies in the genome assembly (e.g. unannotated repeats). So with a better assembly a region that was previously a blacklist wont be one any more. GRCh38 and mm10 have fewer detectable artifacts compared to GRCh37 and mm9 respectively because they are better and more complete assemblies e.g. repeats near centromeres and telomeres are better annotated. Hence the fewer regions. This blacklist release is also a first pass for mm10 and GRCh38 with minimal manual curation. We will be releasing additional refined versions in the future that may capture additional regions."
This made me doubt the advice given in the csaw paper, and my usual processing stages. I've always followed the advice from Heng Li that you should map to an un-masked genome. Then I usually remove the liftOver ENCODE blacklist regions. Instead should I use ENCODE's official mm10 blacklist, and then also remove predicted repeat regions from the UCSC genome annotation?
Is there any reason you keep the mapped reads instead of filtering them out at the earliest convenience? According to this article by Carroll et al, removal of blacklist regions can help improve cross-correlation analysis e.t.c
It's faster to remove the peaks than to remove the reads. Our QC programs are blacklist-aware, so we also don't need to actually bother removing the reads for that either. As an aside, we find strand cross-correlation largely worthless.
Thank you for your replies. To your point, I usually use cross-correlation to infer the fragment size. I use the calculated value for peak-calling and analyses within deeptools which allow fragment size settings. In your opinion is there a more accurate method to infer fragment size, or it has so little effect on the results that it's not actually that important to get perfectly correct?
I wouldn't worry about it being exactly correct, you could just use your bioanalyzer output.
For data generated by collaborators that's my preferred method. Unfortunately the bioanalyzer output isn't always given with the public data / article.
Can you share the DKFZ list you mentioned? I've looked at the initial mm10 release and the mm9 release (lifted over to mm10), and there a quite a lot of high signal regions not covered by the initial mm10 release. Right now I'm just using both the mm9/mm10 release but this blacklists ~1000 regions.
Let me first double check that that's what people are still using internally. If so, I guess I'll email it to you since I haven't a clue if they want it spread around or not.
I can email it to you from the office tomorrow (not sure whether the DKFZ folks want it distributed widely or not). What's a good email address to use for you (I assume firstname.lastname@ed.ac.uk would work)?
Great, many thanks (actually s1437643@sms.ed.ac.uk would be better)
Would you mind sharing the DKFZ blacklist with me as well? Many thanks. myusername[at]ymail.com
Public versions now exist