Question

How many A's ,T's , C's and G's in homo sapiens3 reads data set?

0

Entering edit mode

7.2 years ago

saranpons3 ▴ 70

Hello, I would like to know that How many A's ,T's , C's and G's are there in homo sapiens3 reads data set of 1.8TB data set. Do we have any tool to find this? Generally in Genome of human beings, which nucleotides count is more? thanks.

genome • 3.2k views

ADD COMMENT • link 7.2 years ago by saranpons3 ▴ 70

0

Entering edit mode

Could you clarify what you want to do or find out? Are you asking how to count bases in a certain file type? Or is your question just a general one about AT/GC content in human?

ADD REPLY • link 7.2 years ago by Sean Davis 27k

0

Entering edit mode

I am asking generally how many A's, T's, C's and G's are there in human genome? Which nucleotides is more in human genome?

ADD REPLY • link 7.2 years ago by saranpons3 ▴ 70

0

Entering edit mode

Hello all, When I looked at the numbers in every experiment's output, I could see that A and T are appearing more frequently when compared with C and G. So, Can i say that Human genome will have more A's and T's rather than C and G??!!

ADD REPLY • link 7.2 years ago by saranpons3 ▴ 70

2

Entering edit mode

Yes, this is a rather well known fact. Search for GC-content for more info. There is also the Chargaff's rules stating that the number of As ~= the number of Ts and the number of Gs ~=the number of Cs.

ADD REPLY • link 7.2 years ago by Carlo Yague 8.9k

0

Entering edit mode

What about the masked N bases? Many of those are centromeric 'difficult to sequence' regions, and are extremely high in GC content.

The reference genome is always just our best representation of the genome. The stats comparing AT to GC from the reference build may not therefore be accurate (?)

ADD REPLY • link 7.2 years ago by Kevin Blighe 89k

0

Entering edit mode

While that may be true we have to work with what is available now.

ADD REPLY • link 7.2 years ago by GenoMax 149k

0

Entering edit mode

May i know that what do you mean by we have to work with what is available now?

ADD REPLY • link 7.2 years ago by saranpons3 ▴ 70

0

Entering edit mode

@Kevin was indicating that many of the N's currently used as placeholders in centromeric/telomeric regions may turn out to be GC's in reality (current sequencing technologies are not able to sequence those regions fully). If and when we are able to sequence these regions, proportion of AT/CG will change (as they will do with each major genome build).

ADD REPLY • link 7.2 years ago by GenoMax 149k

score 4 · Answer 1 · 2017-11-27

4

Entering edit mode

7.2 years ago

Pierre Lindenbaum 165k

how many A's, T's, C's and G's are there in human genome?

grep -v '^>' in.fasta | grep -o '.' | sort | uniq -c

(and wait...)

ADD COMMENT • link 7.2 years ago by Pierre Lindenbaum 165k

2

Entering edit mode

(and wait...)

Until cows come home?

Edit: Took more like 30 minutes.

Output (GRCh38 from NCBI):

866420001 A
      2 B
598683433 C
600854940 G
      8 K
      8 M
165045996 N
     26 R
      4 S
868918077 T
     13 W
     33 Y

ADD REPLY • link 7.2 years ago by GenoMax 149k

1

Entering edit mode

I just mentioned today in class that some non-trivial percent of bases in the human genome are Ns. Now I can compute it. Let's see what it adds up to (surprisingly lengthy construct):

cat output | grep -v N | awk ' {print $1}'  | datamash sum 1 | xargs  echo 100 \* 165045996 / | bc -l

Looks like

5.62

close to 6% of bases are labeled as Ns

ADD REPLY • link 7.2 years ago by Istvan Albert 102k

0

Entering edit mode

surprisingly lengthy construct

because bash is not the fastest/best way to handle(...sorting) 3E9 bases. A fast answer would use a C program with an array.

long count[UCHAR_MAX];....

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

emboss compseq does it under 1 min. But for some reason, output doesn't contain bases in lower case. (Working on the code)

$ time(compseq -word 1  hg38.fa result.cmp)
Calculate the composition of unique words in sequences

real    0m58.809s
user    0m58.172s
sys 0m0.620s

output:

#
# Output from 'compseq'
#
# The Expected frequencies are calculated on the (false) assumption that every
# word has equal frequency.
#
# The input sequences are:
#   chr1
#   chr10
#   chr11
#   chr11_KI270721v1_random
#   chr12
#   chr13
#   chr14
#   chr14_GL000009v2_random
#   chr14_GL000225v1_random
#   chr14_KI270722v1_random
# ... et al.


Word size   1
Total count 3209286105

#
# Word  Obs Count   Obs Frequency   Exp Frequency   Obs/Exp Frequency
#
A   898285419       0.2799019   0.2500000   1.1196078
C   623727342       0.1943508   0.2500000   0.7774032
G   626335137       0.1951634   0.2500000   0.7806535
T   900967885       0.2807378   0.2500000   1.1229512

Other   159970322       0.0498461   0.0000000   10000000000.0000000

ADD REPLY • link 7.2 years ago by cpad0112 21k

0

Entering edit mode

Amazingly fast indeed. Also exhibits that endemic and tragic shortsightedness that permeates every EMBOSS tool - an implementation with tacit flaws in just about every tool.

See how it ignores the Ns? As if those did not exist. But they do...

Emboss is a toolset is invented too soon, too much ahead of its time - and with that comes the awkwardness of its interface that keeps it from being successful.

ADD REPLY • link 7.2 years ago by Istvan Albert 102k

0

Entering edit mode

I guess EMBOSS team tried to emulate GCG (success) and then went no where, from there.

ADD REPLY • link 7.2 years ago by cpad0112 21k

0

Entering edit mode

I meant the surprisingly lengthy construct as a qualifier of my own solution of getting the 5.62 number. It felt silly long just to take the fraction relative to a sum. Getting the 6% turned out to be longer than computing all the rest.

ADD REPLY • link 7.2 years ago by Istvan Albert 102k

0

Entering edit mode

I know that A, T, C, G and N can be there in the reads of FASTQ and FASTA files. But, why your result has B, K, M, N, R, S, W and Y? Do reads from reads data set have alphabets other than A, T, C, G and N?

ADD REPLY • link 7.2 years ago by saranpons3 ▴ 70

0

Entering edit mode

But, why your result has B, K, M, N, R, S, W and Y

it's a degenerate nucleotide alphabet (S='Strong'= C or G )

see also http://plindenbaum.blogspot.fr/2013/07/g1kv37-vs-hg19.html

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 165k

0

Entering edit mode