Simulating Read Depth
5
4
Entering edit mode
11.4 years ago

Greetings all,

I am trying to simulate a vector of NGS depth based on mean depth. I do NOT want to simulate reads, just depth.

I have tried simulating a random walk, starting at the mean depth. This captures the autocorrelation of depth across positions, but fails terribly at modeling average depth.

Any ideas would be great!

-Zev

reads depth-of-coverage simulation • 5.6k views
ADD COMMENT
0
Entering edit mode

What would you use the data for? I think that might help narrow down what properties are important and which aren't.

ADD REPLY
6
Entering edit mode
11.4 years ago
#These two functions show how the lander-waterman (1988) genome coverage model works.  Both functions take three parameters. The first function shows how increasing the number of reads increases the probability of a read covering a base (randomly chosen from the genome).  The second shows how increasing the read length affects the probability of a read read covering a base.
#http://en.wikipedia.org/wiki/DNA_sequencing_theory#Lander-Waterman_theory

#Give them a try 

# The probability a base is NOT covered:   e^-(read_length*numer_of_reads/genome_length)
# The probability a base is     covered: 1-e^-(read_length*numer_of_reads/genome_length)

lander.waterman.increasing.reads<-function(max_number_of_reads, read_length, genome_length){
    pdat<-curve(1-exp(1)^-(read_length*x/genome_length), from=1, to=max_number_of_reads)
    plot(pdat$x, pdat$y, xlab="Number of reads", ylab="Prob. of a base being covered", type="l", lwd=2)
}

#Example 1: lander.waterman.increasing.reads(max_number_of_reads=1e9, read_length=100, genome_length=3137161264)

lander.waterman.increasing.read.length<-function(number_of_reads, max_read_length, genome_length){
    pdat<-curve(1-exp(1)^-(x*number_of_reads/genome_length), from=1, to=max_read_length)
    plot(pdat$x, pdat$y, xlab="Read length", ylab="Prob. of a base being covered", type="l", lwd=2)
}

#Example 2: lander.waterman.increasing.read.length(number_of_reads=1e6, max_read_length=1000, genome_length=3137161264)
ADD COMMENT
0
Entering edit mode

Hi Zev,

I was wondering whether you came up with a way to implement your 'read depth simulator' in the end. I am also trying to create a 'read depth simulator' and I was hoping not to reinvent the wheel!

ADD REPLY
5
Entering edit mode
11.4 years ago

You could assume or determine a distribution of read depths (could be measured from empirical data). Then, when you try to determine your next random step you could associate a probability of going up and of going down based on your position in that distribution. This would make your coverages oscillate around the mean.

EDIT: Even though you specified you did not wanted to simulate reads, I was curious and assessed this option. I created an empty genome and put reads at random positions to reach a given coverage. Here is the code:

#!/usr/bin/python
"""Simulate depth of coverage on a per nucleotide basis given a genome size, a
read length, and an average depth

USAGE:
    python simulate_depth.py genome_size read_length average_coverage

genome_size: length of the genome in nucleotides
read_length: length of the reads in nucleotide
average_coverage: average sequencing coverage on the genome

"""

# Importing libraries
import sys
import random

# Main
if __name__ == '__main__':
    try:
        genome_size = int(sys.argv[1])
        read_length = int(sys.argv[2])
        average_coverage = int(sys.argv[3])
    except:
        print __doc__
        sys.exit(1)

    genome = [0] * genome_size
    num_reads = int(float(genome_size) * average_coverage / read_length)
    for read in xrange(num_reads):
        start = random.randint(0, genome_size - read_length)
        for position in xrange(start, start + read_length):
            genome[position] += 1
    print genome

What happens is that the process is really random and so the coverage ends up being pretty uniform throughout. Anybody who has mapped reads to a reference genome knows that such a distribution is not realistic. Coverage will likely pile up high at some positions and be equal to zero at others. Here is a figure from a run with a genome size of 20000 bp, read length of 10 bp and average coverage of 20:

enter image description here

ADD COMMENT
0
Entering edit mode

What if instead of drawing you start position from the random distribution, you did it from a poisson process?

ADD REPLY
0
Entering edit mode

I could probably think of 10 different implementations that would model the distribution better, but there are more interesting problems to explore :) The code is there however, so modify it like you want or write your own version and post your results!

ADD REPLY
5
Entering edit mode
11.4 years ago

A few thoughts:

  • If selection of reads to sequence is truly random, then it's a Poisson process.
  • Looking at real-world data, I can tell you that measuring depth in windows is better approximated by a negative binomial distribution, using an overdispersion of 2-3.
  • It's straightforward to have R generate a vector of window depths, but as you say, they'll be random, not correlated to each other.

Is there any reason why this has to be purely simulated? Why not just take a dozen or so genomes, average them together and come up with a good empirical solution?

ADD COMMENT
0
Entering edit mode

Chris Miller I was thinking empirical data could be used if I cannot get this model to work. The only problem is adding and subtracting (to change the mean depth) does not work so well in the cases of zero depth.

ADD REPLY
2
Entering edit mode
11.4 years ago
Lee Katz ★ 3.2k

You'll want to look at the Lander-Waterman equation. I am not sure exactly how to do it, but the way to do simulate depth will probably be derived from the equation and the stats behind it.

http://www.math.ucsd.edu/~gptesler/186/slides/shotgun_12-handout.pdf

ADD COMMENT
0
Entering edit mode

Great resource. I put the [r] functions below. I have actually implemented these equations. They make really fun plots.

ADD REPLY
2
Entering edit mode
11.4 years ago
KCC ★ 4.1k

What if you used a negative binomial or Poisson distribution to generate the reads. Then, you could use a Cholesky decomposition to create a new sequence that had the correlation between values you were seeking. You could get realistic patterns of correlation between bases from empirical data. (This is what I am thinking.)

Now, to do this properly, one would need to construct a rather large matrix and that's probably not going to work at all. However, you could try constructing a smoothing filter based on the local correlation over a 100 bp window and see if the local correlation translates into a reasonable global pattern.

You should be able to scale the new sequence after the fact to get the mean and variance you want.

I am assuming from your question that you don't want to store the whole genome while you are doing this.

ADD COMMENT
0
Entering edit mode

This sounds like a viable solution. I am looking into Cholesky decomposition. I really suck at linear algebra.

ADD REPLY

Login before adding your answer.

Traffic: 1920 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6