Dear all,
I am currently trying to test a pipeline I am working on and would like to be able to check its sensitivity. The pipeline revolves around detecting large scale deletions. To validate it I would like to use an artificial/simulated data set wither factors that I can control.
I would like to make some read data (preferably in fastq format) from a reference sequence with a singular large deletion and even coverage (except in the deleted region), however I would like to be able to control the ratio of deleted reads: wild type reads eg 50:50 25:75 10:90 etc. I would also like to be able to customise the length and error profiles of these reads whilst still having the same mutational profile.
I am unsure how to go about this though, I have tried using DWGsim but have been unable to customise the number of deleted reads for a deletion that I have tried to simulate using an input VCF with the reference sequence. Are there any tools people would recommend (and if there are how could I use them specifically for this?) or ways in which I could use a tool like DWGsim to achieve this goal?
Many thanks in advance!
Naively, you could just run DWGsim twice, couldn't you? Run once on the reference genome and once on a "new" reference genome with the deletion (that you introduce prior to simulation). Then, if you want 10% reads from the deletion, just subsample from the WT genome at 90% and the Deleted genome 10% and concatenate the results.