Hi friends: How can i download all the CpG islands data of hg38 or hg19 in ucsc? Are there have a CpG island database? thx
Hi friends: How can i download all the CpG islands data of hg38 or hg19 in ucsc? Are there have a CpG island database? thx
For hg19
, you can grab the cpgIslandExt
table from UCSC's goldenpath service, and use BEDOPS sort-bed
to build a sorted BED4+ file:
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/cpgIslandExt.txt.gz \
| gunzip -c \
| awk 'BEGIN{ OFS="\t"; }{ print $2, $3, $4, $5$6, substr($0, index($0, $7)); }' \
| sort-bed - \
> cpgIslandExt.hg19.bed
Derived from the table schema for this file, the first four columns are the island's genomic interval and name. The remaining columns are island length, number of CpGs in the island, the number of C and G in the island, the percentage of island that is CpG, the percentage of island that is C or G, and the ratio of observed(cpgNum) to expected(numC*numG/length) CpG in island.
You can do the same thing for hg38
, with a slight tweak to the URL:
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cpgIslandExt.txt.gz \
| gunzip -c \
| awk 'BEGIN{ OFS="\t"; }{ print $2, $3, $4, $5$6, substr($0, index($0, $7)); }' \
| sort-bed - \
> cpgIslandExt.hg38.bed
The schema is the same between builds, but you can take a look at it here.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks for the answer! Unfortunately, it has an error. When you call
awk 'BEGIN{ OFS="\t"; }{ print $2, $3, $4, $5$6, substr($0, index($0, $7)); }'
, you print a substring starting with the first occurrence of the string found in field 7. So, if the string found in field 7 also occurs earlier in the row, then it'll print from that point instead of field 7. Indeed, on the 11th line of the supplied file, you suddenly have 13 columns instead of 11.Here's a longer but correct snippet:
awk 'BEGIN{ OFS="\t"; }{ print $2, $3, $4, $5$6, $7, $8, $9, $10, $11, $12 }'
And the whole code block:
thanks,it is helpful!