And another solution using UCSC tools and bedtools/samtools. I did this on a university cluster which has some form of linux and slurm
, so this worked easily for me using the available modules in module avail
and loading them via module load
How to generate .bed's for CpG Islands, Shores (2kb flanking the CpG Island), and Shelves (2kb flanking the shore of the CpG Island)
First, generate CpG Island .bed
To do this, use the UCSC tutorial with programs/tutorial here: http://genome.ucsc.edu/cgi-bin/hgTrackUi?g=cpgIslandExt and also use the following tutorial to generate the needed 2bit files for the previous tutorial: https://genome.ucsc.edu/goldenPath/help/twoBit.html
now, some of the tools below from UCSC (binaries) may need to be download via wget and placed into the home directory bin, and (then for easy use) the bin folder should be added to your path, here is how you could download these, make and make them executable:
cd
cd bin/
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faToTwoBit
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/maskOutFa
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh
ls
chmod 744 cpg_lh
chmod 744 faToTwoBit
chmod 744 maskOutFa
chmod 744 twoBitToFa
ensure that the location where you downloaded all these binaries your home directory's bin (~/bin
) is in your $PATH in ~/.bashrc this is pretty much something that added so when you simply type faToTwoBit
and hit enter
in terminal
, terminal will know to look in your ~/bin
folder for the executable faToTwoBit
(and the others that are there!)
if you do not know how to add things to your PATH
, you should please learn, It has helped me numerus times and here too : ) it's good to know so that if you need to install miscellaneous things quick you can. Google will lead to you stackoverflow links, which should be very helpful!
you should do the rest of this in another folder (not the same as your ~/bin) maybe a folder named genomes
or cpgs
or something else
Download genome.fa file and gunzip
wget https://ftp.ensembl.org/pub/release-108/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
gunzip Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
then:
faToTwoBit Mus_musculus.GRCm39.dna.primary_assembly.fa Mus_musculus.GRCm39.dna.primary_assembly_fa.2bit
now, generate CpGIsland.bed (from UCSC tutorial)
twoBitToFa Mus_musculus.GRCm39.dna.primary_assembly_fa.2bit stdout | maskOutFa stdin hard stdout \
| cpg_lh /dev/stdin 2> cpg_lh.err \
| awk '{$2 = $2 - 1; width = $3 - $2; printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \
| sort -k1,1 -k2,2n > cpgIsland.bed
some doubts:
only doubt with the above stuff, is that, I used the Ensembl genome assembly 108, which is the most recently updated [2022] (compared to the UCSC [2020])the doubt arises with UCSC's use of the 0-base convention, where they use 0-base starts (for I think everythin) where everyone else uses a 1-base start convention...
and so the tools above are all from UCSC, and again, the genome assembly, I used, is Ensembl, so I am unsure if the 0 or 1 base is mismatched slightly in some way that I don't know right now!I can figure it out, but for the sake of time (and hypothesis generation) I don't think 0 or 1 base will make or break a new hypothesis being formed from the results
therefore, I will not prioritize that detail, for now***, to focus on what's to be found. (I'll check this and, if needed, fix it after, not too critical right now)
the CpGIsland.bed looks good (checked the Ensembl browser for coordinates that Ensembl's version of CPgisland.bed [generated above] had that UCSC didn't,
and just visually, there were CGs present on the forward strand, so I drew the conclusion that the CpGIsland.bed generated from Ensembl (again above) was generated correctly without mistakes or just things screwing up)
now, we will follow this tutorial: http://genomespot.blogspot.com/2015/04/what-is-cpg-shore-and-how-to-i-get-them.html (from How to find CpG island shore? by mark.ziemann ) to generate a .bed file of CpG shores,
and later adapt the tools (by reading some the manual 'man' pages from some of the tools from bedtools) to generate .bed of CpG shelves
samtools faidx Mus_musculus.GRCm39.dna.primary_assembly.fa
cut -f-2 Mus_musculus.GRCm39.dna.primary_assembly.fa.fai > Mus_musculus.GRCm39.dna.primary_assembly.fa.g
bedtools slop -i cpgIsland.bed -g Mus_musculus.GRCm39.dna.primary_assembly.fa.g -b 2000 | bedtools merge \
| bedtools subtract -a - -b cpgIsland.bed > cpgShores.bed
now for CpGshelves.bed
bedtools slop -i cpgIsland.bed -g Mus_musculus.GRCm39.dna.primary_assembly.fa.g -b 4000 | bedtools merge \
| bedtools subtract -a - -b cpgIsland.bed | bedtools subtract -a - -b cpgShores.bed > cpgShelves.bed
done
Sorry for the sloppiness, hopefully, this is useful to someone (who doesn't normally use Perl and has used samtools, bedtools, and installed some UCSC tools for misc. purposes).
I tried to write a perl code to subtract 2000bp and add 2000 to the CpG island region to get CpG shore regions (55436 CpG shores) [Note: I removed chrUn, random and hap regions from hg19 genomes]. and then use bedtools subtract to remove CpG shores overlapped with CpG island (48790) and then merge the CpG shores which are overlapped within CpG shore regions (46817 CpG shore).
Eventually, 46817 CpG shore region were obtained. and you can download this annotation: hg19.cpgshoreExt.txt
You can find the code in Perl Script for Bed file preparation for CpGI, CpG Shore, CpG shelf regions