Motivation
I would like to model the performance of a structural variant (SV) caller that is being developed in my lab for NGS technologies. One of its strengths is the ability to integrate all available alignment "signals" (e.g., discordant alignments, split-reads, etc.) when calling SVs, and as such, we observe big gains in sensitivity.
To demonstrate the utility of the software in the context of cancer genomics, we would like to demonstrate the ability to detect SVs at different frequencies (e.g. found in 0.1%, 1%, 2%, 5%, 20%, etc.) among the tumor cells and at different overall sequencing depths (e.g., 5X, 10X, 20X). For example, we want to assess our power to detect SVs at 5% frequency in the tumor with 20X coverage and compare this sensitivity to other tools.
The Question
Now, we have the ability to simulate a FASTA file with chromosomes having variants at different frequencies. For example, to simulate a variant present in 10% of the cells, one could create a FASTA file with 9 copies of the wild-type chromosome and 1 copy of the mutant chrom containing the SV.
My question is what read simulation tools can be used to guarantee that, if we ask for say 20X coverage, the 10 versions of the chromosome above are sampled from randomly. In essence we want to ensure that we don't have false negatives owing to a lack of data for the mutant chromosomes - we want to measure algorithmic false negatives, not data false negatives.
Is there a read simulation tool that can handle this? Will wgsim
work?
Also, if you can think of a better way to do the simulation or a tool that is specifically designed to do this, I would be grateful to know of it.
I see, so create two separate FASTA files - one for the mutant, one for the wild-type. Sample from each and pool? I like that - much simpler than what I was thinking. I was unaware of
dwgsim
- since I only want to model sensitivity for SVs, do you know if one sets -r to 0.0 it will inject 0 SNPs?Right, -r is the mutation rate so if it's zero it should add no SNPs and indels (at least according to my understanding of how it works). In that case I guess the -H flag wouldn't matter either, since I think that only affects whether it models heterozygous mutations or not.
I think the only issue that remains is that there is no guarantee that a read spanning the SV breakpoint will be sampled from the mutant chrom, especially at lower coverages. As such, are the coordinates from which the read was sampled in the name field for the resulting sample such that one can check to make sure the breakpoint was sampled (to rule out data FNs from alg. FNs)?
That's true, at low coverage you'll probably miss some breaks. The read names do have coordinates in the FASTA file they were generated from, and you can add a prefix to the read name with -P so that you can label which FASTA file they came from; I've been using this and it works pretty well. If you have lots of SVs it can be tricky to map the read generation coordinates back to a reference coordinate; that's the only hard part.