Hi, everyone!
I am studying the performance of my algorithm, where I need simulation. I looked up the ones used by 1000genomes, but people there said it was outdated, and suggested finding a new one.
I overheard some software called ART, but cannot find it on web.
I also read some similar paper which used simulation, but all of them did not point out what existing software or data sets they used.
Beside the software or datasets to do the simulations in box, I want to know more about the details of and the principle behind the simulation.
The settings are like this:
- First construct the diploid of a human(only consider SNPs/indels, not including other type of variations)
- Generate templates with Gussian distributed length and coming with equal prob from the 4 strand of DNA(+/- strand of two homologous chromosomes, with the error rate similar with that of the sequencing machine like illumina hiseq 2000
- Get 100 bp reads from each template.
The key is how to construct the diploid of a human so that it best resemble a "typical" person in a population in study. Anyone has any idea? Randomly select of a bp to be different from the ref with the prob. of the mutation rate, say 1%? But the mutation rate should be different on different regions, so how to simulate this scenario? Or to the aim of the study, as long as the simulation is not for study depending on the distribution of the variations, this could be omitted?
Thank you very much!
Yi
Whoa! That's one gigantic list! Goes to show the richness in just any subdomain of bioinformatics.
5 years later, this is still a very impressive list. I have nothing to add but a (unfortunately not so recent) paper that might be of use as additional reference
https://dx.doi.org/10.1038%2Fnrg.2016.57
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5224698/
Originally posted by @Joseph Hughes
BBMap's RandomReads: Generates single-ended or paired Illumina reads, or PacBio reads, from a genome. Also has a metagenome mode.