Super-easy with seqtk
:
## Get seqtk from conda or from Git:
conda install -c bioconda seqtk
## This will give you the output (see below) for each chromosome in hg38.fa
seqtk comp hg38.fa
Output format: chr, length, #A, #C, #G, #T, #2, #3, #4, #CpG, #tv, #ts, #CpG-ts
## Or for the entire fasta:
seqtk comp hg38.fa | awk 'OFS="\t" {sumA+=$3; sumC+=$4; sumG+=$5; sumT+=$6} END {print "A:"sumA,"C:"sumC,"G:"sumG,"T:"sumT}'
If you want fractions, just modify the snippet to divide by the sum of the four columns or the entire number of characters in hg38.fa.
By the way, the numbers for hg38, including the EBV decoy, including random and U contigs, excluding any ALT sequences are:
A:866386023 C:598632365 G:600803779 T:868882461
It is a reference sequence and once you get the numbers for each base, you can calculate the frequency easy. Refer to a fast solution in SU forum (copy/pasted here):
replace
file
with hg38.fa. Up vote OP. Unfortunately it is not case insensitive.Example run time on an i5-6200 with 8 gb ram:
Thanks for your reply. So if I want to calculate the %A, then should I include "#a" in the calculation too? Or just stick to the upper case A. (Same goes for other nucleotides)
All. They might be soft-masked as repetitive region nucleotides, but are still part of the genome.
Definitely not very fast and is ram intensive but this is what I use
grep -v ">" input.fa |grep -o . |sort |uniq -c
Gives counts which can then be converted to percentages