Hello biostars,
I want to find all the # of CpG sites in Chr3.fa from UCSC.
Opening the chr3.fa in microsoft Word and using the CTRL + F, Find and Replace crashed the program. I tried to make a C++ script to count the CGs in a test file.
#include <istream>
#include <cstdlib>
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
//Open the file:
ifstream inputfile;
inputfile.open("/mnt/i/home/Desktop/test.txt");
//Get length of the file:
inputfile.seekg (0, inputfile.end);
int length = inputfile.tellg();
inputfile.seekg (0, inputfile.beg);
// read each base in the file for C
int i;
int cgcounter;
char firstbuffer[1]; // Check if that is C
char secondbuffer[1]; // Check if this is G
for (i=0; i<= length; i++)
{
inputfile.seekg(i, ios::beg);
inputfile.read(firstbuffer,1);
if (firstbuffer[1] == 'C' || firstbuffer[1] == 'c')
{
inputfile.seekg(i+1, ios::beg);
inputfile.read(secondbuffer,1);
if (secondbuffer[1] == 'G' || secondbuffer[1] == 'g')
{
cgcounter++;
i++;
}
}
}
cout << "The number of CGs in your file is " << cgcounter << "\n" ;
return 0;
}
The test.txt is a file with just AAAAAAAAAAACGAAAAAAAAAAAACGAAAAAAAAAAACG (three CGs), but when I run the script, I get different outputs each time:
labyu@DESKTOP-U037B9F:/mnt/i/home/Workspac$ g++ cgcounter.cpp
labyu@DESKTOP-U037B9F:/mnt/i/home/Workspac$ ./a.out
The number of CGs in your file is 635585824
labyu@DESKTOP-U037B9F:/mnt/i/home/Workspac$ ./a.out
The number of CGs in your file is 711083296
The code compiles fine, but I am not sure where these random numbers are coming from. I am very new to C++ and C. Alternatively, is their an easier way to find the # of CGs in a fasta/txt file?
Thanks!
Thanks Pierre!
I am trying to learn C as well. This is easy to understand. Could you explain a few parts of the program? How come we can store a character 'C' or 'G' in the prev variable if we declared it as an integer earlier? Is there any significance to initializing it as -1.
the size of an integer is greater than the size of a char
sizeof(char)<sizeof(int)
, so you can always store a char into a 'int'. Furthermore, fgetc returns a 'int', not a 'char' http://www.cplusplus.com/reference/cstdio/fgetc/-1 is arbitrarty but I'm sure it is initialized as a non interesting value like 'A', 'T', '>', '\n' ...