simulating illumina data with spiked variants
2
1
Entering edit mode
10.5 years ago
Richard ▴ 590

Hi all,

I am looking to simulate some paired illumina data for a test. What I want to do in order of importance (most important at the top)

  1. Create fastq files.
  2. Specify specific SNPs to be in the data
  3. Control the allelic fractions of the spiked in SNPs
  4. Have an appropriate error model of illumina sequencing
  5. Have controllable metrics like duplicate rate, chastity fail rate

There seem to be a number of tools available for simulating illumina - do we know of one that can handle my requirements?

simulated-data illumina • 2.6k views
ADD COMMENT
2
Entering edit mode
10.5 years ago

Aside from #5, any of the common simulators (wgsim, Sherman, etc.) can do that. For the SNPs, just make a second genome containing them, sample from that as well, and then mix the results in the fraction that you'd like.

ADD COMMENT
0
Entering edit mode

I checked both wgsim and sherman and I didn't see a way to spike in specific variants (base change and position). Am I missing something?

ADD REPLY
1
Entering edit mode

Read the entirety of my answer, I mentioned the variants explicitly.

ADD REPLY
0
Entering edit mode
7.3 years ago
Gabriel R. ★ 2.9k

For 1, 2 (maybe 3), 4 we developed a sequencing simulator for ancient DNA: grenaud.github.io/gargammel/ But it can be used to simulate modern DNA. For the PhiX, you can add it as a "microbial contaminant". It automates the process of fragment size distribution, sampling from a diploid genome and generate Illumina-like fastq files. Just give it a "diploid" genome represented by 2 fasta files.

ADD COMMENT

Login before adding your answer.

Traffic: 1600 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6