Hello, I would like to know that How many A's ,T's , C's and G's are there in homo sapiens3 reads data set of 1.8TB data set. Do we have any tool to find this? Generally in Genome of human beings, which nucleotides count is more? thanks.
Hello, I would like to know that How many A's ,T's , C's and G's are there in homo sapiens3 reads data set of 1.8TB data set. Do we have any tool to find this? Generally in Genome of human beings, which nucleotides count is more? thanks.
how many A's, T's, C's and G's are there in human genome?
grep -v '^>' in.fasta | grep -o '.' | sort | uniq -c
(and wait...)
I just mentioned today in class that some non-trivial percent of bases in the human genome are N
s. Now I can compute it. Let's see what it adds up to (surprisingly lengthy construct):
cat output | grep -v N | awk ' {print $1}' | datamash sum 1 | xargs echo 100 \* 165045996 / | bc -l
Looks like
5.62
close to 6%
of bases are labeled as N
s
emboss compseq does it under 1 min. But for some reason, output doesn't contain bases in lower case. (Working on the code)
$ time(compseq -word 1 hg38.fa result.cmp)
Calculate the composition of unique words in sequences
real 0m58.809s
user 0m58.172s
sys 0m0.620s
output:
#
# Output from 'compseq'
#
# The Expected frequencies are calculated on the (false) assumption that every
# word has equal frequency.
#
# The input sequences are:
# chr1
# chr10
# chr11
# chr11_KI270721v1_random
# chr12
# chr13
# chr14
# chr14_GL000009v2_random
# chr14_GL000225v1_random
# chr14_KI270722v1_random
# ... et al.
Word size 1
Total count 3209286105
#
# Word Obs Count Obs Frequency Exp Frequency Obs/Exp Frequency
#
A 898285419 0.2799019 0.2500000 1.1196078
C 623727342 0.1943508 0.2500000 0.7774032
G 626335137 0.1951634 0.2500000 0.7806535
T 900967885 0.2807378 0.2500000 1.1229512
Other 159970322 0.0498461 0.0000000 10000000000.0000000
Amazingly fast indeed. Also exhibits that endemic and tragic shortsightedness that permeates every EMBOSS tool - an implementation with tacit flaws in just about every tool.
See how it ignores the Ns? As if those did not exist. But they do...
Emboss is a toolset is invented too soon, too much ahead of its time - and with that comes the awkwardness of its interface that keeps it from being successful.
But, why your result has B, K, M, N, R, S, W and Y
it's a degenerate nucleotide alphabet (S='Strong'= C or G )
see also http://plindenbaum.blogspot.fr/2013/07/g1kv37-vs-hg19.html
See this answer:
time(zgrep -v ">" ../reference/hg38/hg38.fa.gz | fold -w1 | sort | uniq -c)
463840423 a
434444996 A
328257999 c
295469343 C
330651380 g
295683757 G
3144 n
159967178 N
465881183 t
435086702 T
real 52m44.104s
user 45m5.512s
sys 0m49.220s
sum: 3209286105 and 455 fasta headers.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Could you clarify what you want to do or find out? Are you asking how to count bases in a certain file type? Or is your question just a general one about AT/GC content in human?
I am asking generally how many A's, T's, C's and G's are there in human genome? Which nucleotides is more in human genome?
Hello all, When I looked at the numbers in every experiment's output, I could see that A and T are appearing more frequently when compared with C and G. So, Can i say that Human genome will have more A's and T's rather than C and G??!!
Yes, this is a rather well known fact. Search for GC-content for more info. There is also the Chargaff's rules stating that the number of As ~= the number of Ts and the number of Gs ~=the number of Cs.
What about the masked N bases? Many of those are centromeric 'difficult to sequence' regions, and are extremely high in GC content.
The reference genome is always just our best representation of the genome. The stats comparing AT to GC from the reference build may not therefore be accurate (?)
While that may be true we have to work with what is available now.
May i know that what do you mean by we have to work with what is available now?
@Kevin was indicating that many of the N's currently used as placeholders in centromeric/telomeric regions may turn out to be GC's in reality (current sequencing technologies are not able to sequence those regions fully). If and when we are able to sequence these regions, proportion of AT/CG will change (as they will do with each major genome build).