How to find out total # of CpGs sites from a fasta file?
3
0
Entering edit mode
8.0 years ago
orlando.wong ▴ 60

Hello biostars,

I want to find all the # of CpG sites in Chr3.fa from UCSC.

Opening the chr3.fa in microsoft Word and using the CTRL + F, Find and Replace crashed the program. I tried to make a C++ script to count the CGs in a test file.

#include <istream>
#include <cstdlib>
#include <iostream>
#include <fstream>
using namespace std;

int main()
{

        //Open the file:
        ifstream inputfile;
        inputfile.open("/mnt/i/home/Desktop/test.txt");


        //Get length of the file:
        inputfile.seekg (0, inputfile.end);
        int length = inputfile.tellg();
        inputfile.seekg (0, inputfile.beg);

        // read each base in the file for C
        int i;
        int cgcounter;
        char firstbuffer[1]; // Check if that is C
        char secondbuffer[1]; // Check if this is G
        for (i=0; i<= length; i++)
                {

                        inputfile.seekg(i, ios::beg);
                        inputfile.read(firstbuffer,1);
                        if (firstbuffer[1] == 'C' || firstbuffer[1] == 'c') 
                                {

                                     inputfile.seekg(i+1, ios::beg);
                                     inputfile.read(secondbuffer,1); 
                                                if (secondbuffer[1] == 'G' || secondbuffer[1] == 'g') 
                                                        {
                                                                cgcounter++;
                                                                i++;
                                                        }


                                }
                }




        cout << "The number of CGs in your file is " << cgcounter << "\n" ;
        return 0;
}

The test.txt is a file with just AAAAAAAAAAACGAAAAAAAAAAAACGAAAAAAAAAAACG (three CGs), but when I run the script, I get different outputs each time:

 labyu@DESKTOP-U037B9F:/mnt/i/home/Workspac$ g++ cgcounter.cpp
labyu@DESKTOP-U037B9F:/mnt/i/home/Workspac$ ./a.out
The number of CGs in your file is 635585824
labyu@DESKTOP-U037B9F:/mnt/i/home/Workspac$ ./a.out
The number of CGs in your file is 711083296

The code compiles fine, but I am not sure where these random numbers are coming from. I am very new to C++ and C. Alternatively, is their an easier way to find the # of CGs in a fasta/txt file?

Thanks!

sequence Cpp fasta CpG • 4.3k views
ADD COMMENT
4
Entering edit mode
8.0 years ago

A simple C program:

#include <stdio.h>


int main(int argc,char** argv)
    {
    long N=0L;
    int prev=-1;
    for(;;)
        {
        int c=fgetc(stdin);
        switch(c)
            {
            case EOF: case '>': {
                if(N>0) {
                    printf("\t%ld\n",N);
                    }
                if(c==EOF) return 0;
                fputc('>',stdout);
                while((c=fgetc(stdin))!=EOF && c!='\n') {
                    fputc(c,stdout);
                    }
                N=0L;
                prev=-1;
                break;
                }
            case '\n':break;
            case 'g':case 'G': {
                if(prev=='C') N++;
                prev='G';
                break;
                }
            case 'c': case 'C': {
                if(prev=='G') N++;
                prev='C';
                break;
                }
            default: {
                prev=c;
                break;
                }
            }
        }
    return 0  ;
    }

e.g:

 wget -q -O - "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr3.fa.gz" | gunzip -c | ./a.out 
>chr3   9469074
ADD COMMENT
0
Entering edit mode

Thanks Pierre!

I am trying to learn C as well. This is easy to understand. Could you explain a few parts of the program? How come we can store a character 'C' or 'G' in the prev variable if we declared it as an integer earlier? Is there any significance to initializing it as -1.

ADD REPLY
1
Entering edit mode

How come we can store a character 'C' or 'G' in the prev variable if we declared it as an integer earlier?

the size of an integer is greater than the size of a char sizeof(char)<sizeof(int), so you can always store a char into a 'int'. Furthermore, fgetc returns a 'int', not a 'char' http://www.cplusplus.com/reference/cstdio/fgetc/

Is there any significance to initializing it as -1.

-1 is arbitrarty but I'm sure it is initialized as a non interesting value like 'A', 'T', '>', '\n' ...

ADD REPLY
3
Entering edit mode
8.0 years ago

Using locate (usage) command of SeqKit, which provides executable binary files for Windows/Linux... Just download, decompress and run.

Test data in FASTA format:

$ cat test.txt 
>test
AAAAAAAAAAAC
GAAAAAAAAAAA
ACGAAAAAAAAA
AACGcgc

Locating the CGs:

$ ./seqkit locate --ignore-case --only-positive-strand --pattern "(CG)+" test.txt 
seqID   patternName     pattern strand  start   end     matched
test    (CG)+   (CG)+   +       12      13      CG
test    (CG)+   (CG)+   +       26      27      CG
test    (CG)+   (CG)+   +       39      42      CGcg

Aligned result:

$ ./seqkit locate --ignore-case --only-positive-strand --pattern "(CG)+" test.txt | column -t 
seqID  patternName  pattern  strand  start  end  matched
test   (CG)+        (CG)+    +       12     13   CG
test   (CG)+        (CG)+    +       26     27   CG
test   (CG)+        (CG)+    +       39     42   CGcg

But, what's this for?

ADD COMMENT
0
Entering edit mode

Thanks shen wei! I am trying to compute coverage of CpG sites from some datasets from the Infinium Methylation 450k. It has about 450,000 CpG sites and I wanted to know how much of chr 3's CpG sites were covered.

ADD REPLY
3
Entering edit mode
8.0 years ago

I think this could be done with Linux utilities (assuming you are on a Unix system)

grep -v '>' chroms.fa | grep -o -i 'CG' | wc -l

Edit: Of course this answers only the question How many CpG in total, it doesn't tell you where they are.

It's curious that you are switching from a very basic solution (MS Word) to an overly engineered one (C++)!

For the future, maybe worth looking into Python for this sort of text processing.

ADD COMMENT
2
Entering edit mode

this will miss patterns like:

nnnnnnG
Cnnnnnn
ADD REPLY
0
Entering edit mode

Indeed! That's quite an embarrassing oversight!

ADD REPLY
2
Entering edit mode

or you can transfrom all the non GC to '\n'

grep -v "^>" chroms.fa  | tr -d "\n" | tr -c "[GCgc]" "\n" | grep -v '^$' | grep -o -i 'CG' | wc -l
ADD REPLY
0
Entering edit mode

Thanks so much! That command worked on a huge file so quickly. I will take your advice and learn some Linux utilities and Python for simplicity.

Also, will that count the lower case 'cg' or 'cG' or 'Gc'?

ADD REPLY
1
Entering edit mode

Take care that my command is buggy (see shenwei356's comment). Use Pierre's version if anything. But yes, grep -i is case insensitive and will match cg or CG. And yes, grep is quite fast.

ADD REPLY

Login before adding your answer.

Traffic: 1624 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6