I am doing a benchmarking of SV detection methods against simulated genome.
What indels/SNPs ratio should I use when simulating the variants?
I read in Dindel paper that they used a 1:9 ratio (1 indel created every 9 SNPs) and in Genetic Variation in an Individual Human Exome that the ratio is 1:7 genome wide (while it is 1:43 in coding regions).
For one individual, 100bp reads, I believe the ratio is about 1:6. Read length certainly matters. The number of samples is likely to have an effect, too.
For one individual, 100bp reads, I believe the ratio is about 1:6. Read length certainly matters. The number of samples is likely to have an effect, too.
EDIT: Most indels are in long tandem repeats. If your reads are shorter than the repeat, there is no way to call the indel. The real ratio should be between 1:5 and 1:6 per individual when you could sequence the entire chromosome. Nonetheless, it is very hard to place simulated indels. For your purpose, how precise the ratio is does not matter at all. The real question is where to put the indels.
Thanks for your answer but I don't understand why the read length matters. I am talking about real SNP/indels not detected one by the sequencing analysis. Could you please clarify this point?
Take a look at the 1000 Genomes paper. As I recall they found the ratio of variants they found was around 10:1 (SNPs to indels). I don't think anyone would complain if you chose that ratio.
Zam
PS I agree, I think the read-length is irrelevant for your question.
For one individual, 100bp reads, I believe the ratio is about 1:6. Read length certainly matters. The number of samples is likely to have an effect, too.