How to find CpG island shore?

0

Entering edit mode

10.9 years ago

achamess ▴ 90

Hi,

I'm interested in analyzing the methylation of the TRPV1 gene. I need to find CpG islands as well as CpG island shores, since I've read these these shores can also be differentially methylated, even when the CpG islands aren't.

Could anyone point me to some web based or desktop tools to find CpG island shores? Or, if there aren't any tools off-the-shelf, could you describe a method for finding these shore regions?

Thanks

methylation • 9.1k views

ADD COMMENT • link updated 2.4 years ago by Pratik ★ 1.1k • written 10.9 years ago by achamess ▴ 90

1

Entering edit mode

I tried to write a perl code to subtract 2000bp and add 2000 to the CpG island region to get CpG shore regions (55436 CpG shores) [Note: I removed chrUn, random and hap regions from hg19 genomes]. and then use bedtools subtract to remove CpG shores overlapped with CpG island (48790) and then merge the CpG shores which are overlapped within CpG shore regions (46817 CpG shore).

Eventually, 46817 CpG shore region were obtained. and you can download this annotation: hg19.cpgshoreExt.txt

You can find the code in Perl Script for Bed file preparation for CpGI, CpG Shore, CpG shelf regions

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by Shicheng Guo ★ 9.6k

4

Entering edit mode

10.9 years ago

Chirag Nepal ★ 2.4k

https://www.ncbi.nlm.nih.gov/pubmed/19151715

Here we show that most methylation alterations in colon cancer occur not in promoters, and also not in CpG islands, but in sequences up to 2 kb distant, which we term 'CpG island shores'.

It it is so, it is very st forward to get shores, using one liner in awk.

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by Chirag Nepal ★ 2.4k

0

Entering edit mode

Now there is CpG "ravines", inaddition to CoG island and CpG shores

http://www.epigeneticsandchromatin.com/content/7/1/28/abstract

ADD REPLY • link 10.5 years ago by Chirag Nepal ★ 2.4k

0

Entering edit mode

10.0 years ago

mark.ziemann ★ 2.0k

You can find CpG shores by looking firstly at where the CpG islands are on your gene at the UCSC browser under "regulation" and then work in a 2 kbp window.

If you need to identify them across the whole genome, just follow my blog post.

http://genomespot.blogspot.com.au/2015/04/what-is-cpg-shore-and-how-to-i-get-them.html

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 10.0 years ago by mark.ziemann ★ 2.0k

0

Entering edit mode

8.1 years ago

EagleEye 7.6k

Here is one more solution,

	#!/usr/bin/perl
	$/=undef;
	## Run script: perl CpG_feature_extraction.pl CpG_islands.bed shores/shelves (This gives warnings for 'rm' command, just ignore it)
	# Reference for definition of shores and shelves from:
	# http://www.mdpi.com/2073-4425/5/2/347/htm
	# http://www.mdpi.com/genes/genes-05-00347/article_deploy/html/images/genes-05-00347-g001-1024.png
	extend_coordinates($ARGV[0] ,$ARGV[1]);
	sub extend_coordinates
	{
	$type=$ARGV[1];
	open(FILE, ">>${ARGV[0]}_${type}.bed");
	open(IN,$ARGV[0]) or die $!;
	$file=<IN>;
	@myfile=split("\n",$file);
	foreach $myfile(@myfile)
	{
	($chr,$start,$stop,$name)=split("\t",$myfile);
	if($type eq "shores")
	{
	$up_start=$start-2000;
	$up_end=$start;
	$dn_start=$stop;
	$dn_end=$stop+2000;
	}
	if($type eq "shelves")
	{
	$up_start=$start-4000;
	$up_end=$start-2000;
	$dn_start=$stop+2000;
	$dn_end=$stop+4000;
	}
	print FILE "$chr\t$up_start\t$up_end\t${name}_up\n$chr\t$dn_start\t$dn_end\t${name}_dn\n";
	}
	}
	close(FILE);
	system( "rm ${ARGV[0]}_${ARGV[1]}.bed ${ARGV[0]}_${ARGV[1]}_noOverlap.bed" );
	&extend_coordinates;
	system( "sort -k1,1 -k2,2n ${ARGV[0]}_${ARGV[1]}.bed \| awk '{if($1!=l\|\|$2>e){if(l)print l,b,e,$4;l=$1;b=$2;e=$3}else e=$3}END{print l,b,e,$4}' \| sed 's/ /\t/g' > ${ARGV[0]}_${ARGV[1]}_noOverlap.bed " );
	system( "rm ${ARGV[0]}_${ARGV[1]}.bed" );

view raw CpG_feature_extraction.pl hosted with ❤ by GitHub

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 8.1 years ago by EagleEye 7.6k

0

Entering edit mode

2.4 years ago

Pratik ★ 1.1k

And another solution using UCSC tools and bedtools/samtools. I did this on a university cluster which has some form of linux and slurm, so this worked easily for me using the available modules in module avail and loading them via module load

How to generate .bed's for CpG Islands, Shores (2kb flanking the CpG Island), and Shelves (2kb flanking the shore of the CpG Island)

First, generate CpG Island .bed

To do this, use the UCSC tutorial with programs/tutorial here: http://genome.ucsc.edu/cgi-bin/hgTrackUi?g=cpgIslandExt and also use the following tutorial to generate the needed 2bit files for the previous tutorial: https://genome.ucsc.edu/goldenPath/help/twoBit.html

now, some of the tools below from UCSC (binaries) may need to be download via wget and placed into the home directory bin, and (then for easy use) the bin folder should be added to your path, here is how you could download these, make and make them executable:

cd
cd bin/
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faToTwoBit
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/maskOutFa
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh
ls
chmod 744 cpg_lh 
chmod 744 faToTwoBit 
chmod 744 maskOutFa 
chmod 744 twoBitToFa

ensure that the location where you downloaded all these binaries your home directory's bin (~/bin) is in your $PATH in ~/.bashrc this is pretty much something that added so when you simply type faToTwoBit and hit enter in terminal, terminal will know to look in your ~/bin folder for the executable faToTwoBit (and the others that are there!)

if you do not know how to add things to your PATH, you should please learn, It has helped me numerus times and here too : ) it's good to know so that if you need to install miscellaneous things quick you can. Google will lead to you stackoverflow links, which should be very helpful!

you should do the rest of this in another folder (not the same as your ~/bin) maybe a folder named genomes or cpgs or something else Download genome.fa file and gunzip

wget https://ftp.ensembl.org/pub/release-108/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz

gunzip Mus_musculus.GRCm39.dna.primary_assembly.fa.gz

then:

faToTwoBit Mus_musculus.GRCm39.dna.primary_assembly.fa Mus_musculus.GRCm39.dna.primary_assembly_fa.2bit

now, generate CpGIsland.bed (from UCSC tutorial)

twoBitToFa Mus_musculus.GRCm39.dna.primary_assembly_fa.2bit stdout | maskOutFa stdin hard stdout \
  | cpg_lh /dev/stdin 2> cpg_lh.err \
    |  awk '{$2 = $2 - 1; width = $3 - $2;  printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \
     | sort -k1,1 -k2,2n > cpgIsland.bed

some doubts: only doubt with the above stuff, is that, I used the Ensembl genome assembly 108, which is the most recently updated [2022] (compared to the UCSC [2020])the doubt arises with UCSC's use of the 0-base convention, where they use 0-base starts (for I think everythin) where everyone else uses a 1-base start convention... and so the tools above are all from UCSC, and again, the genome assembly, I used, is Ensembl, so I am unsure if the 0 or 1 base is mismatched slightly in some way that I don't know right now!I can figure it out, but for the sake of time (and hypothesis generation) I don't think 0 or 1 base will make or break a new hypothesis being formed from the results therefore, I will not prioritize that detail, for now***, to focus on what's to be found. (I'll check this and, if needed, fix it after, not too critical right now)

the CpGIsland.bed looks good (checked the Ensembl browser for coordinates that Ensembl's version of CPgisland.bed [generated above] had that UCSC didn't, and just visually, there were CGs present on the forward strand, so I drew the conclusion that the CpGIsland.bed generated from Ensembl (again above) was generated correctly without mistakes or just things screwing up)

now, we will follow this tutorial: http://genomespot.blogspot.com/2015/04/what-is-cpg-shore-and-how-to-i-get-them.html (from How to find CpG island shore? by mark.ziemann ) to generate a .bed file of CpG shores,
and later adapt the tools (by reading some the manual 'man' pages from some of the tools from bedtools) to generate .bed of CpG shelves

samtools faidx Mus_musculus.GRCm39.dna.primary_assembly.fa


cut -f-2 Mus_musculus.GRCm39.dna.primary_assembly.fa.fai > Mus_musculus.GRCm39.dna.primary_assembly.fa.g


bedtools slop -i cpgIsland.bed -g Mus_musculus.GRCm39.dna.primary_assembly.fa.g -b 2000 | bedtools merge \
| bedtools subtract -a - -b cpgIsland.bed > cpgShores.bed

now for CpGshelves.bed

bedtools slop -i cpgIsland.bed -g Mus_musculus.GRCm39.dna.primary_assembly.fa.g -b 4000 | bedtools merge \
| bedtools subtract -a - -b cpgIsland.bed | bedtools subtract -a - -b cpgShores.bed > cpgShelves.bed

done

Sorry for the sloppiness, hopefully, this is useful to someone (who doesn't normally use Perl and has used samtools, bedtools, and installed some UCSC tools for misc. purposes).

ADD COMMENT • link 2.4 years ago by Pratik ★ 1.1k

Login before adding your answer.