Question

NGS data simulation: VarSim or BAMSurgeon?

2

Entering edit mode

8.2 years ago

user230613 ▴ 380

Hi there,

I want to generate NGS data to do some test and benchmark in both germline and somatic variant calling. I've read a lot of papers about different tools and different tools benchmarks but I want to know your feedback. After reading the papers, I have chosen two tools: VarSim and BAMSurgeon.

BAMSurgeon uses pre-existing BAM files and adds new variants to them. It's has been widely used in DREAM challenge for testing variant calling algorithms so I assume that it works really nice. Using pre-existing BAM files, the advantage is that you can real data and then introduce new variants for the benchmarking.
For other hand, VarSim is able to generate read files taking as input a reference genome and a set of variants. All the data here is purely simulated (well, the variants can be random or previously described ones), and the advantage is that you can somehow control different types of error (like sequencing errors and so on). And also, having fastq files it is possible to test a full pipeline of Alignment+Variant_calling workflow.

At the end, What I would like to have is set of tumor/normal pair fastq files, with a true.vcf dataset, and then be able to play and adjust different parameters like: _clonality, heterogeneity, contamination, sequencing error.._

Sorry if the question is too open or wide. I'd like to receive suggestions and personal experiences about the best way to generate this kind of data. If its specific por Exome/Target sequencing would be even better.

Thank you in advance,

simulation varsim bamsurgeon • 7.6k views

ADD COMMENT • link updated 8.2 years ago by d-cameron ★ 2.9k • written 8.2 years ago by user230613 ▴ 380

score 3 · Answer 1 · 2017-02-24

3

Entering edit mode

8.2 years ago

d-cameron ★ 2.9k

For somatic SV simulation, I'm yet to find a tool that can generate realistic data. The problem with simulating reads from the reference genome is that you present your variant caller with much easier problem that actual data. Real data is much messier (especially for repetitive sequence) and by simulating reads from the reference you will overestimate your variant callers' performance.

BAMSurgeon probably comes the closest to realistic data since it using existing sequencing data, but the types of SV events it can simulate are very limited and it does not handle some important classes of cancer driver mutations such as inter-chromosomal gene fusions. Additionally, the alignment-based event insertion approach taken by BAMSurgeon is not appropriate for repetitive regions as the BAMSurgeon approach assumes that the reads originating from the region that the event is to be simulated are correctly mapped to that region.

That said, I've used ART for SV simulation off hg19 but as you can see from my benchmarking results (http://shiny.wehi.edu.au/cameron.d/sv_benchmark/ ), ROC curves for the simulated variants are vastly better than the ROC curves for real data. The simulations are useful for determining best-case variant caller performance (eg the smallest event size detectable by SV caller X), but should not be taken as reflecting performance on actual data.

These issues may be less problematic for SNV and small indel variants.

ADD COMMENT • link 8.2 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Do you mean VarSim+Art when you say that you used Art?

ADD REPLY • link 7.6 years ago by yeinhorn • 0

0

Entering edit mode

Just ART from FASTA files. I created script to generate the FASTA files since VarSim only supports simple ins/del/dup/inv SVs.

Entire classes of somatic mutations (gene fusion, chromoplexy/chromothripsis/breakage-fusion-bridge, double minutes, ...) were missing from the simulators the last time I checked. By far the biggest issue I had with somatic simulations was the lack of aneuploidy and inter-chromosomal rearrangements. The majority of the cancers I've analysed were most definitely not simple diploid genomes with some SNVs and simple local rearrangements thrown in. 50+ copies of an unmutated oncogene is not unexpected for cancers showing signs of chromothripis/breakage-fusion-bridge.

ADD REPLY • link 7.6 years ago by d-cameron ★ 2.9k

0

Entering edit mode

I'm wondering the http://shiny.wehi.edu.au/cameron.d/sv_benchmark/ is still available? I'm not able to see the results.

ADD REPLY • link 6.5 years ago by tingting.gong • 0

0

Entering edit mode

Unfortunately not. We do have a benchmarking paper with more comprehensive results coming out soon.

ADD REPLY • link 6.4 years ago by d-cameron ★ 2.9k

score 2 · Answer 2 · 2017-02-24

2

Entering edit mode

8.2 years ago

Joseph Hughes ★ 3.0k

Here is a recent paper that reviews different NGS read simulators. I think the decision tree figure is useful.

I had a related question and ended up using ART.

ADD COMMENT • link 7.3 years ago by Joseph Hughes ★ 3.0k

0

Entering edit mode

Joseph's link is dead, but the paper still valuable:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5224698/

or

https://dx.doi.org/10.1038%2Fnrg.2016.57

ADD REPLY • link 5.2 years ago by Carambakaracho ★ 3.3k