Question

DNase data for GRCh38

3

Entering edit mode

8.5 years ago

Adam Przedniczek ▴ 30

This question below turned to be completely faulty. I don't have to do anything with DNase data for GRCh38. I asked it because of the file count difference between hg38 and hg37, which I thought to be too big. For hg38 there're 95 files *Peak.txt.gz. For hg37 there're 236 *narrowPeak.gz, but after merging pairs PkRep1 & PkRep2 (probably FASTQ(SE/PE) reps) we get only 123 files. Finally, this difference (123 & 95) no longer seems to be big and we have even cleaner situation without PkRep1 & PkRep2.

One again: there's no problem with DNase data for GRCh38 assembly and only my question was misleading. I'd like to apologise for the confusion I introduced.

I'm interesed in transciptional activity, thus I'm willing to use DNase hypersensitivity sites to detect regions where transcription factors are allowed to bind. In previous genome assembly GRCh37 / hg19 I used to use narrow peaks files from these to sources (University of Washington and Duke University, respectively) (files with suffixes .narrowPeak.gz):

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUwDnase/ http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeOpenChromDnase/

With the most contemporary assembly GRCh38 there're also some annotations attached (files with trailing Peak.txt.gz): http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/

And here four complementary question arise:

Consider only datasets, which come from University of Washington. For GRCh38 / hg19 I counted 236 narrow peak files, whereas for newer GRCh38 there're only 95 files. How to explain this differene? Do the datasets represent exactly the same coverage, but with much lower granularity / precision (datasets that come for several tissue lines are merged into fewer files)?

With GRCg37 / hg19 we have both narrow peaks as well as broad peaks, whereas GRCh38 comes with only one type of of file *Peak.txt.gz. Does it mean that with the newest version we have only narrow peaks? Are the broad peaks hidden somewhere else?

With GRCh37 / hg19 we have two separate sources of DNase data: UofW and Duke. For GRCh38, it seems that only UofW datasets are availabe. Is any other source of DNase data available, maybe stored separately (Duke or other lab)?

Let's suppose that you're in my place and you would like to determine cis-regulatory areas. What type of data can be used to do so? Mabey DNase datasets but from other source or even completly different type of data (NOT DNase)?

~~Thank you in advance for your answer.~~

dnase open chromatin narrow peak broad peak • 2.3k views

ADD COMMENT • link 8.5 years ago by Adam Przedniczek ▴ 30

1

Entering edit mode

8.5 years ago

Denise CS ★ 5.2k

Ensembl has mapped the DNase1 data from ENCODE (in addition to Blueprint and Roadmap epigenomics) to GRCh38 as part of the Ensembl Regulatory Build. The data is available from the Ensembl FTP and Perl API.

ADD COMMENT • link 8.5 years ago by Denise CS ★ 5.2k

score 2 · Accepted Answer · 2016-06-03

There's an awful lot of ENCODE data so I'm not 100% sure of this answer but I'll have a crack.

The first phase of ENCODE finished around 2012 so all the data were mapped onto GRCh37/Hg19 (2009). I believe most of the first wave data were generated in human cell lines.

GRCh38 was released in 2013 so I'm guessing that the second wave of ENCODE data (primary tissue) currently in progress, is being mapped onto the 2013 (GRCh38) release.

That would mean that the bulk samples in 38/39 are not the same samples.

It is entirely possible to convert data currently on GRCh38 into GRCh39 but I don't know if ENCODE is doing this. That would involve remapping onto 39 instead of 38 and re-running the analyses.

If you want to convert peak files, you can always use a liftover tool to put all datasets onto the build of your choice.

Just make sure that the samples aren't duplicates.