How To Interpret The Formula Of Estimating The Genome Size From Kmers
1
0
Entering edit mode
10.9 years ago
Sabiha • 0

This is the formula to estimate the genome size.

N = (M*L)/(L-K+1)

and

Genome_size = T/N,

where

N: Depth, M: Kmer peak, K: Kmer-size, L: avg readlength, T: Total bases.

What is the Total bases?

Is it the total bases of reads been taken by any assembler(abyss,soap)

How will I know the total bases

genome kmer • 4.1k views
ADD COMMENT
0
Entering edit mode

Hi Sabiha,

Have you ever come across a situation that different k value will lead to greatly different genome size result? For example, with read length of 300bp, I test on my data with k=33 and have an only peak at 26, then k=121 have an only peak at 7, the calculated N (depth) based on your formula is 29.1 and 11.7, correspondingly. Given that T (total bases) is the same (of course because same data set), therefore the genome size would be greatly different! How would you determine which one is the good estimation of your genome size?

Thanks.

ADD REPLY
0
Entering edit mode
9.8 years ago
edrezen ▴ 730

Total bases is the number of nucleotides in your data. For instance:

>seq1 len=15
ATCACACAGTTGTAC
>s2 len=21
ATAGATAGAATATGATAGATA

Here, T=15+21=36

If your file is in FASTA format, you can use the following to get the number of bases:

grep -v ">" file.fasta | wc -c
ADD COMMENT

Login before adding your answer.

Traffic: 2798 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6