How To Produce Simulated 'Synthetic' Sequences
4
24
Entering edit mode
14.4 years ago

Hi.

I would like to find a program that produces simulated (Illumina GAII) reads starting from a fasta file of the genome and a series of parameters so that I can then test a method I am developing. The features (in descending order) are:

  • Uses a fasta file as input (or something easily produced from a fasta file)
  • Outputs random reads (or simulate as much as possible any known bias of GAII)
  • Is written in (Bio)Perl, Python, ISO C/C++ or fully open source platforms.
  • Produces errors (like GAII) at known, ideally tunable, rate
  • Allows me to specify depth of coverage and length of reads
  • Produces qseq or similar as output
  • Allows to produce paired end reads

Often methods papers have some analysis of 'synthetic' or simulated data, but they usually don't bother to publish the program to produce such data. I guess I could write a "quick and dirty" program to do it, but I'd rather not reinvent the wheel if there is something available

next-gen sequencing simulation model • 12k views
ADD COMMENT
11
Entering edit mode
14.4 years ago

In addition to MetaSim, samtools has wgsim located in the misc directory:

http://samtools.sourceforge.net/

http://bioinformatics.bc.edu/chuanglab/wiki/index.php/How_to_use_SAMtools

It's open source C, and has tunable parameters for error rates, read pair distribution and number of reads generated.

ADD COMMENT
0
Entering edit mode

Thanks. That is just what I was looking for, and I already had it on my computer!

ADD REPLY
8
Entering edit mode
14.4 years ago
Michael 55k

Try MetaSim. That adresses most of your features (except it's written in Java, which is maybe a plus).

ADD COMMENT
7
Entering edit mode
14.4 years ago

The author of BFAST has a read simulation program in the dnaa toolkit. A few enhancements over wgsim from samtools.

ADD COMMENT
2
Entering edit mode
14.4 years ago
User 59 13k

You might want to look at shuffleseq in EMBOSS.

This will allow you to specify an input fasta file and maintain nucleotide composition at least. You can then shred this into simulated 'reads' which should be relatively straightforward.

You might also be able to use msbar to introduce mutations, but whether they're platform appropriate or not is another question.

EDIT:

Why can't you just use an actual data set? Looking at the Abyss paper, they used experimental data :

sequence data for the genome of an African male individual (HapMap DNA identifier NA18507) (International HapMap Consortium 2003, 2007) from the NCBI short read archive (accession no. SRA000271). The sequence was generated by Illumina, Inc. using their Genome Analyzer platform (Bentley et al. 2008).

as well as synthetic data generated like this:

The first synthetic data set represented all possible error-free 36-mer paired sequences, using a fixed fragment size of 200 bp. We generated simulated reads by sliding a 200 bp window, with a step size of 1 bp, along each chromosome of the reference genome and reporting the first 36 bp and the reverse complement of the last 36 bp. This process produced a data set of perfectly tiled 72-fold read coverage of the reference genome.

ADD COMMENT
0
Entering edit mode

I am afraid this is not what I am looking for. I need that from a series of fasta sequences (the human chromosomes) produces little bits (36-72 Bp long) as if it was "sequencing" the genome

ADD REPLY
0
Entering edit mode

In that case you don't want it to output 'random' reads :)

ADD REPLY
0
Entering edit mode

@Daniel. I can't use real data because I cannot change the underlying genome. I want to make, for instance, a duplication of a chromosome arm, then simulate "random reads" from that genome (with associated noise) and then use the program I am developing to see if I can pick up the duplication and with which resolution. getting every 200 bp is no good, because then it is trivial (and not realistic)

ADD REPLY
0
Entering edit mode

I found this tool which sounds similar to what you are looking for: bamsurgeon. Havent tried it yet though

ADD REPLY

Login before adding your answer.

Traffic: 2590 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6