Question

How to download all the CpG islands data of hg38 or hg19 in ucsc?

6

Entering edit mode

8.3 years ago

winjorchen ▴ 60

Hi friends: How can i download all the CpG islands data of hg38 or hg19 in ucsc? Are there have a CpG island database? thx

genome alignment sequence next-gen • 16k views

ADD COMMENT • link updated 3.5 years ago by arturtjaro ▴ 50 • written 8.3 years ago by winjorchen ▴ 60

score 10 · Answer 1 · 2017-02-11

10

Entering edit mode

8.3 years ago

Alex Reynolds 36k

For hg19, you can grab the cpgIslandExt table from UCSC's goldenpath service, and use BEDOPS sort-bed to build a sorted BED4+ file:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/cpgIslandExt.txt.gz \
   | gunzip -c \
   | awk 'BEGIN{ OFS="\t"; }{ print $2, $3, $4, $5$6, substr($0, index($0, $7)); }' \
   | sort-bed - \
   > cpgIslandExt.hg19.bed

Derived from the table schema for this file, the first four columns are the island's genomic interval and name. The remaining columns are island length, number of CpGs in the island, the number of C and G in the island, the percentage of island that is CpG, the percentage of island that is C or G, and the ratio of observed(cpgNum) to expected(numC*numG/length) CpG in island.

You can do the same thing for hg38, with a slight tweak to the URL:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cpgIslandExt.txt.gz \
   | gunzip -c \
   | awk 'BEGIN{ OFS="\t"; }{ print $2, $3, $4, $5$6, substr($0, index($0, $7)); }' \
   | sort-bed - \
   > cpgIslandExt.hg38.bed

The schema is the same between builds, but you can take a look at it here.

ADD COMMENT • link 8.3 years ago by Alex Reynolds 36k

1

Entering edit mode

Thanks for the answer! Unfortunately, it has an error. When you call awk 'BEGIN{ OFS="\t"; }{ print $2, $3, $4, $5$6, substr($0, index($0, $7)); }', you print a substring starting with the first occurrence of the string found in field 7. So, if the string found in field 7 also occurs earlier in the row, then it'll print from that point instead of field 7. Indeed, on the 11th line of the supplied file, you suddenly have 13 columns instead of 11.

Here's a longer but correct snippet: awk 'BEGIN{ OFS="\t"; }{ print $2, $3, $4, $5$6, $7, $8, $9, $10, $11, $12 }'

And the whole code block:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cpgIslandExt.txt.gz \
   | gunzip -c \
   | awk 'BEGIN{ OFS="\t"; }{ print $2, $3, $4, $5$6, $7, $8, $9, $10, $11, $12 }' \
   | sort-bed - \
   > cpgIslandExt.hg38.bed

ADD REPLY • link 3.5 years ago by arturtjaro ▴ 50

0

Entering edit mode

thanks，it is helpful!

ADD REPLY • link 8.2 years ago by winjorchen ▴ 60

score 4 · Answer 2 · 2017-02-11

4

Entering edit mode

8.3 years ago

EagleEye 7.6k

You can use table browser.

ADD COMMENT • link 8.3 years ago by EagleEye 7.6k

0

Entering edit mode

thanks! it is a easy way to get it, i never find this way befor!

ADD REPLY • link 8.2 years ago by winjorchen ▴ 60