I am trying to simulate a vector of NGS depth based on mean depth. I do NOT want to simulate reads, just depth.
I have tried simulating a random walk, starting at the mean depth. This captures the autocorrelation of depth across positions, but fails terribly at modeling average depth.
#These two functions show how the lander-waterman (1988) genome coverage model works. Both functions take three parameters. The first function shows how increasing the number of reads increases the probability of a read covering a base (randomly chosen from the genome). The second shows how increasing the read length affects the probability of a read read covering a base.
#http://en.wikipedia.org/wiki/DNA_sequencing_theory#Lander-Waterman_theory
#Give them a try
# The probability a base is NOT covered: e^-(read_length*numer_of_reads/genome_length)
# The probability a base is covered: 1-e^-(read_length*numer_of_reads/genome_length)
lander.waterman.increasing.reads<-function(max_number_of_reads, read_length, genome_length){
pdat<-curve(1-exp(1)^-(read_length*x/genome_length), from=1, to=max_number_of_reads)
plot(pdat$x, pdat$y, xlab="Number of reads", ylab="Prob. of a base being covered", type="l", lwd=2)
}
#Example 1: lander.waterman.increasing.reads(max_number_of_reads=1e9, read_length=100, genome_length=3137161264)
lander.waterman.increasing.read.length<-function(number_of_reads, max_read_length, genome_length){
pdat<-curve(1-exp(1)^-(x*number_of_reads/genome_length), from=1, to=max_read_length)
plot(pdat$x, pdat$y, xlab="Read length", ylab="Prob. of a base being covered", type="l", lwd=2)
}
#Example 2: lander.waterman.increasing.read.length(number_of_reads=1e6, max_read_length=1000, genome_length=3137161264)
I was wondering whether you came up with a way to implement your 'read depth simulator' in the end. I am also trying to create a 'read depth simulator' and I was hoping not to reinvent the wheel!
ADD REPLY
• link
updated 3.0 years ago by
Ram
44k
•
written 10.3 years ago by
Dataman
▴
380
You could assume or determine a distribution of read depths (could be measured from empirical data). Then, when you try to determine your next random step you could associate a probability of going up and of going down based on your position in that distribution. This would make your coverages oscillate around the mean.
EDIT: Even though you specified you did not wanted to simulate reads, I was curious and assessed this option. I created an empty genome and put reads at random positions to reach a given coverage. Here is the code:
#!/usr/bin/python
"""Simulate depth of coverage on a per nucleotide basis given a genome size, a
read length, and an average depth
USAGE:
python simulate_depth.py genome_size read_length average_coverage
genome_size: length of the genome in nucleotides
read_length: length of the reads in nucleotide
average_coverage: average sequencing coverage on the genome
"""
# Importing libraries
import sys
import random
# Main
if __name__ == '__main__':
try:
genome_size = int(sys.argv[1])
read_length = int(sys.argv[2])
average_coverage = int(sys.argv[3])
except:
print __doc__
sys.exit(1)
genome = [0] * genome_size
num_reads = int(float(genome_size) * average_coverage / read_length)
for read in xrange(num_reads):
start = random.randint(0, genome_size - read_length)
for position in xrange(start, start + read_length):
genome[position] += 1
print genome
What happens is that the process is really random and so the coverage ends up being pretty uniform throughout. Anybody who has mapped reads to a reference genome knows that such a distribution is not realistic. Coverage will likely pile up high at some positions and be equal to zero at others. Here is a figure from a run with a genome size of 20000 bp, read length of 10 bp and average coverage of 20:
I could probably think of 10 different implementations that would model the distribution better, but there are more interesting problems to explore :) The code is there however, so modify it like you want or write your own version and post your results!
If selection of reads to sequence is truly random, then it's a Poisson process.
Looking at real-world data, I can tell you that measuring depth in windows is better approximated by a negative binomial distribution, using an overdispersion of 2-3.
It's straightforward to have R generate a vector of window depths, but as you say, they'll be random, not correlated to each other.
Is there any reason why this has to be purely simulated? Why not just take a dozen or so genomes, average them together and come up with a good empirical solution?
Chris Miller I was thinking empirical data could be used if I cannot get this model to work. The only problem is adding and subtracting (to change the mean depth) does not work so well in the cases of zero depth.
You'll want to look at the Lander-Waterman equation. I am not sure exactly how to do it, but the way to do simulate depth will probably be derived from the equation and the stats behind it.
What if you used a negative binomial or Poisson distribution to generate the reads. Then, you could use a Cholesky decomposition to create a new sequence that had the correlation between values you were seeking. You could get realistic patterns of correlation between bases from empirical data. (This is what I am thinking.)
Now, to do this properly, one would need to construct a rather large matrix and that's probably not going to work at all. However, you could try constructing a smoothing filter based on the local correlation over a 100 bp window and see if the local correlation translates into a reasonable global pattern.
You should be able to scale the new sequence after the fact to get the mean and variance you want.
I am assuming from your question that you don't want to store the whole genome while you are doing this.
What would you use the data for? I think that might help narrow down what properties are important and which aren't.