I have the hg38 reference sequence. I wanted to calculate the percentage of each of the bases in the entire genome. So how do I find out the %A,%G,%C,%T in the hg38.fa file? Is this information already available? If not, can someone please guide me through the steps required to get this information?
Thanks!
It is a reference sequence and once you get the numbers for each base, you can calculate the frequency easy. Refer to a fast solution in SU forum (copy/pasted here):
replace
file
with hg38.fa. Up vote OP. Unfortunately it is not case insensitive.Example run time on an i5-6200 with 8 gb ram:
Thanks for your reply. So if I want to calculate the %A, then should I include "#a" in the calculation too? Or just stick to the upper case A. (Same goes for other nucleotides)
All. They might be soft-masked as repetitive region nucleotides, but are still part of the genome.
Definitely not very fast and is ram intensive but this is what I use
grep -v ">" input.fa |grep -o . |sort |uniq -c
Gives counts which can then be converted to percentages