A perennial question on bioinformatics sites is how to compute the effective genome size for a genome. epic now includes a script called epic-effective to do just this. It can use multiple cores. So the next time someone asks about the effective genome size, you know where to point them.
endrebak@havpryd ~/c/epic> epic-effective -h
epic-effective
Compute the effective genome size from a fasta file.
(Visit github.com/endrebak/epic for examples and help.)
Usage:
epic-effective --read-length=LEN [--nb-cpu=CPU] FILE
epic-effective --help
Arguments:
FILE Fasta genome
-r LEN --read-length LEN length of reads
Options:
-h --help show this help message
-n CPU --nb-cpu CPU number of cores to use [default: 1]
endrebak@havpryd ~/c/epic> time epic-effective -r 35 -n 30 ~/genomes/hg19.fa
File analyzed: /local/home/endrebak/genomes/hg19.fa
Genome length: 3095693983
Number unique 35-mers: 2529802735
Effective genome size: 0.8172005207531522
3250.78user 32.13system 4:46.33elapsed 1146%CPU (0avgtext+0avgdata 100815072maxresident)k
6186643inputs+0outputs (0major+162990minor)pagefaults 0swaps
To install the epic-package, just use pip install bioepic
.
It is uses jellyfish under the hood so you need to install that too. If you use conda conda install jellyfish
works on linux64.
I have included how I compute the effective genome size below. If the formula is too simplistic, please tell me.
This is how I compute the EGS:
The effective genome size for a genome G and a read-length L is the number of unique L-mers in G divided by the length of G.
So for reads of length 2 the effective genome size of the geome CCCGNN is the following:
len(["CG"]) / len("CCCGNN")
or 1/6.
Edit: sorry about the bump. My link was wrong!