Question

Simulating Read Depth

4

Entering edit mode

11.4 years ago

Zev.Kronenberg 12k

Greetings all,

I am trying to simulate a vector of NGS depth based on mean depth. I do NOT want to simulate reads, just depth.

I have tried simulating a random walk, starting at the mean depth. This captures the autocorrelation of depth across positions, but fails terribly at modeling average depth.

Any ideas would be great!

-Zev

reads depth-of-coverage simulation • 5.7k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 11.4 years ago by Zev.Kronenberg 12k

0

Entering edit mode

What would you use the data for? I think that might help narrow down what properties are important and which aren't.

ADD REPLY • link 11.4 years ago by KCC ★ 4.1k

Ram · Answer 1 · 2013-06-28

#These two functions show how the lander-waterman (1988) genome coverage model works.  Both functions take three parameters. The first function shows how increasing the number of reads increases the probability of a read covering a base (randomly chosen from the genome).  The second shows how increasing the read length affects the probability of a read read covering a base.
#http://en.wikipedia.org/wiki/DNA_sequencing_theory#Lander-Waterman_theory

#Give them a try 

# The probability a base is NOT covered:   e^-(read_length*numer_of_reads/genome_length)
# The probability a base is     covered: 1-e^-(read_length*numer_of_reads/genome_length)

lander.waterman.increasing.reads<-function(max_number_of_reads, read_length, genome_length){
    pdat<-curve(1-exp(1)^-(read_length*x/genome_length), from=1, to=max_number_of_reads)
    plot(pdat$x, pdat$y, xlab="Number of reads", ylab="Prob. of a base being covered", type="l", lwd=2)
}

#Example 1: lander.waterman.increasing.reads(max_number_of_reads=1e9, read_length=100, genome_length=3137161264)

lander.waterman.increasing.read.length<-function(number_of_reads, max_read_length, genome_length){
    pdat<-curve(1-exp(1)^-(x*number_of_reads/genome_length), from=1, to=max_read_length)
    plot(pdat$x, pdat$y, xlab="Read length", ylab="Prob. of a base being covered", type="l", lwd=2)
}

#Example 2: lander.waterman.increasing.read.length(number_of_reads=1e6, max_read_length=1000, genome_length=3137161264)

score 5 · Answer 2 · 2013-06-28

You could assume or determine a distribution of read depths (could be measured from empirical data). Then, when you try to determine your next random step you could associate a probability of going up and of going down based on your position in that distribution. This would make your coverages oscillate around the mean.

EDIT: Even though you specified you did not wanted to simulate reads, I was curious and assessed this option. I created an empty genome and put reads at random positions to reach a given coverage. Here is the code:

#!/usr/bin/python
"""Simulate depth of coverage on a per nucleotide basis given a genome size, a
read length, and an average depth

USAGE:
    python simulate_depth.py genome_size read_length average_coverage

genome_size: length of the genome in nucleotides
read_length: length of the reads in nucleotide
average_coverage: average sequencing coverage on the genome

"""

# Importing libraries
import sys
import random

# Main
if __name__ == '__main__':
    try:
        genome_size = int(sys.argv[1])
        read_length = int(sys.argv[2])
        average_coverage = int(sys.argv[3])
    except:
        print __doc__
        sys.exit(1)

    genome = [0] * genome_size
    num_reads = int(float(genome_size) * average_coverage / read_length)
    for read in xrange(num_reads):
        start = random.randint(0, genome_size - read_length)
        for position in xrange(start, start + read_length):
            genome[position] += 1
    print genome

What happens is that the process is really random and so the coverage ends up being pretty uniform throughout. Anybody who has mapped reads to a reference genome knows that such a distribution is not realistic. Coverage will likely pile up high at some positions and be equal to zero at others. Here is a figure from a run with a genome size of 20000 bp, read length of 10 bp and average coverage of 20:

enter image description here

score 5 · Answer 3 · 2013-06-28

5

Entering edit mode

11.4 years ago

Chris Miller 22k

A few thoughts:

If selection of reads to sequence is truly random, then it's a Poisson process.
Looking at real-world data, I can tell you that measuring depth in windows is better approximated by a negative binomial distribution, using an overdispersion of 2-3.
It's straightforward to have R generate a vector of window depths, but as you say, they'll be random, not correlated to each other.

Is there any reason why this has to be purely simulated? Why not just take a dozen or so genomes, average them together and come up with a good empirical solution?

ADD COMMENT • link 11.4 years ago by Chris Miller 22k

0

Entering edit mode

Chris Miller I was thinking empirical data could be used if I cannot get this model to work. The only problem is adding and subtracting (to change the mean depth) does not work so well in the cases of zero depth.

ADD REPLY • link 11.4 years ago by Zev.Kronenberg 12k

score 2 · Answer 4 · 2013-06-28

2

Entering edit mode

11.4 years ago

Lee Katz ★ 3.2k

You'll want to look at the Lander-Waterman equation. I am not sure exactly how to do it, but the way to do simulate depth will probably be derived from the equation and the stats behind it.

http://www.math.ucsd.edu/~gptesler/186/slides/shotgun_12-handout.pdf

ADD COMMENT • link 11.4 years ago by Lee Katz ★ 3.2k

0

Entering edit mode

Great resource. I put the [r] functions below. I have actually implemented these equations. They make really fun plots.

ADD REPLY • link 11.4 years ago by Zev.Kronenberg 12k

score 2 · Answer 5 · 2013-07-01

What if you used a negative binomial or Poisson distribution to generate the reads. Then, you could use a Cholesky decomposition to create a new sequence that had the correlation between values you were seeking. You could get realistic patterns of correlation between bases from empirical data. (This is what I am thinking.)

Now, to do this properly, one would need to construct a rather large matrix and that's probably not going to work at all. However, you could try constructing a smoothing filter based on the local correlation over a 100 bp window and see if the local correlation translates into a reasonable global pattern.

You should be able to scale the new sequence after the fact to get the mean and variance you want.

I am assuming from your question that you don't want to store the whole genome while you are doing this.