Extract Cg Dinucleotides Coordinates From (Human) Genome
2
2
Entering edit mode
11.6 years ago
Philippe ★ 1.9k

Hi everybody,

to perform some analyses I need to extract the coordinates of all CG dinucleotides from the human genome (hg19). It would not be extremely difficult to write such a script but, for several reasons, I would prefer to rely on already existing tools or processed datasets.

However, I was surprised that I found nothing on Google or on data repository such as UCSC,...

Is there to your knowledge such a tool or a downloadable dataset?

Thanks in advance for your answers.

genome • 6.3k views
ADD COMMENT
3
Entering edit mode
11.6 years ago

something like this ?

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>

#define SEQNAME_MAX BUFSIZ

int main(int argc,char** argv)
    {
    int c;
    int pos=0;
    int prev_c=-1;
    char name[SEQNAME_MAX];
    name[0]=0;

    for(;;)
            {
            switch((c=fgetc(stdin)))
                {
        case EOF: return EXIT_SUCCESS;
        case '>':
            {
            int space=0;
            int name_length=0;
            name[0]=0;
            pos=0;
            while((c=fgetc(stdin))!=EOF && c!='\n')
                {    
                if(space) continue;
                if(isspace(c)) { space=1; continue;}
                name[name_length++]=c;
                }
            name[name_length]=0;
                    prev_c=-1;
            break;
            }
        case '\n':case '\r':case ' ':break;
        case 'g': case 'G': 
            {
            if(prev_c=='C')
                {
                printf("%s\t%d\n",name,pos);
                }
            prev_c=c;
            ++pos;
            break;
            }
        case 'c': c='C';/* continue */
                default:prev_c=c; ++pos;break;
                }
        }
    return EXIT_SUCCESS;
    }

.

$ gcc -O3 -Wall input.c
$ curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrM.fa.gz" |  gunzip -c | ./a.out  | head
chrM    33
chrM    61
chrM    78
chrM    80
chrM    91
chrM    96
chrM    105
chrM    120
chrM    162
chrM    170
ADD COMMENT
0
Entering edit mode

Thanks for the code Pierre. I coded something similar (in python) but I was more interested in published solution as a comparison.

ADD REPLY
0
Entering edit mode

I can upload the code on figshare if needed :-P

ADD REPLY
0
Entering edit mode

Hi Philippe and Pierre,

Did you ever find a solution to this? I am very interested in finding all the CpG sites in the human genome also?

-Diviya

ADD REPLY
0
Entering edit mode

Great script, Pierre.

ADD REPLY
2
Entering edit mode
7.8 years ago
Shicheng Guo ★ 9.5k

Anyone want to validate my perl script. save this script as: cgpositionFinder.pl

#!/usr/bin/perl
# search the postions for the CpGs in human genome
# input are chr1.fa, chr2.fa and so on
# output will be chr1.CpG.positions.txt and so on.
# Finally, merge all the CpG.positions.txt and bedtools sort.  

my $fa=shift @ARGV;
open F,$fa;
my $seq;
my($chr,undef)=split/.fa/,$fa;
my $out="$chr.CpG.positions.txt";
open F,$fa;
while(<F>){
next if />/;
chomp;
$seq .=$_;
}
close F;

sub match_positions {
    my ($regex, $string) = @_;
    return if not $string =~ /($regex)/;
    return (pos($string), pos($string) + length $1);
}
sub all_match_positions {
    my ($regex, $string) = @_;
    my @ret;
    while ($string =~ /($regex)/ig) {
        push @ret, [(pos($string)-length $1),pos($string)-1];
    }
    return @ret
}

my $regex='CG';
my $string=$seq;
my $cgap=3;
my @pos=all_match_positions($regex,$string);
my @hgcg;
open OUT,">$out";
foreach my $pos(@pos){
    push @hgcg,@$pos[1];
}
foreach my $i(0..($#hgcg)){
my $pos=$hgcg[$i]-1;  # transfer to 0-based coordinate
print OUT "$chr\t$pos\n";
}
close OUT;

usage:

perl cgpositionFinder.pl chr1.fa
perl cgpositionFinder.pl chr2.fa
perl cgpositionFinder.pl chr3.fa
........

output

chr1.CpG.positions.txt
chr2.CpG.positions.txt
chr3.CpG.positions.txt
........

example:

wget http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz
tar xzvf chromFa.tar.gz
for i in `ls *fa`
do
perl ~/bin/cgPositionFinder.pl $i
done
ADD COMMENT
0
Entering edit mode

Hi Sheldon,

i tried your script for the Mouse mm10 genome. I think that nothing change since the fasta file is always the same. The output returned me an empty file. The fasta file name of the genome is: genome.fa No error was present while the script was running or at the end.. any idea?

ADD REPLY
1
Entering edit mode

Script have no problem. However, input are chr1.fa chr2.fa .... and output will be chr1.CpG.positions.txt ....

ADD REPLY
0
Entering edit mode

I dont know if it useful or not but I used your script and it is running. Thank you. I have a question;

What is the format that you got at the end of it. I had to put "End" manually but when I compared your chr,Start file format vs mine(chr start end gave the same thing.

Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 1677 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6