What Ngs Read Simulators Are Available For Paired-End Data?
7
22
Entering edit mode
14.3 years ago

Hi all, I need to create simulated paired-end sequence data with fixed read-lengths on each end (e.g., 75mers on each end of a 500bp DNA fragment, a la Illumina). Does anyone know of a reliable simulator that can generate paired-end sequences to a requested depth, with a requested insert size/variance and error rate, for a requested genome in a FASTA file? The output would preferably be two FASTQ files, one for each end.

I can write my own, but do not want to re-invent this boring (though useful) wheel. Any clues?

next-gen sequencing fastq simulation paired • 22k views
ADD COMMENT
0
Entering edit mode

See also the following thread discussing read simulation with quality scores: http://bit.ly/kNePbA

ADD REPLY
23
Entering edit mode
14.3 years ago

samtools wgsim does most of what you request:

Usage:   wgsim [options] <in.ref.fa> <out.read1.fq> <out.read2.fq>

Options: -e FLOAT      base error rate [0.020]
         -d INT        outer distance between the two ends [500]
         -s INT        standard deviation [50]
         -N INT        number of read pairs [1000000]
         -1 INT        length of the first read [70]
         -2 INT        length of the second read [70]
         -r FLOAT      rate of mutations [0.0010]
         -R FLOAT      fraction of indels [0.10]
         -X FLOAT      probability an indel is extended [0.30]
         -c            generate reads in color space (SOLiD reads)
         -C            show mismatch info in comment rather than read name
         -h            haplotype mode

Note: For SOLiD reads, the first read is F3 and the second is R3.
ADD COMMENT
0
Entering edit mode

Perfect. I hadn't looked in the misc/ directory in awhile and I never saw documentation for this. Thanks Keith!

ADD REPLY
8
Entering edit mode
14.3 years ago

MetaSim may be a good option. It has platform specific error modeling and that makes it suited for generating realistic input data rather than "perfectly" random reads.

ADD COMMENT
0
Entering edit mode

another solid choice, thank you.

ADD REPLY
6
Entering edit mode
14.1 years ago
Jorjial ▴ 300

You can also try dwgsim. This is a fork of the SAMtools wgsim and its creator is Nils Homer.

Usage:   dwgsim [options] <in.ref.fa> <out.bwa.read1.fq> <out.bwa.read2.fq> <out.bfast.fq>

Options: -e FLOAT      base error rate [0.020]
         -E FILE       base/color error rate file
         -d INT        outer distance between the two ends [500]
         -s INT        standard deviation [50]
         -N INT        number of read pairs [1000000]
         -1 INT        length of the first read [70]
         -2 INT        length of the second read [70]
         -r FLOAT      rate of mutations [0.0010]
         -R FLOAT      fraction of indels [0.10]
         -X FLOAT      probability an indel is extended [0.30]
         -n INT        maximum number of Ns allowed in a given read[0]
         -c            generate reads in color space (SOLiD reads)
         -h            haplotype mode
ADD COMMENT
0
Entering edit mode

From my experience dwgsim is much better that its predecessor wgsim. The former has some nice features and seem to be maintaned. wgsim as of now had the last commit years ago.

ADD REPLY
2
Entering edit mode
ADD COMMENT
1
Entering edit mode
14.1 years ago
Ketil 4.1k

Note the difference between Illumina's paired ends (just reading from each end of a clone), and circularized clones (mate pairs), which give longer inserts, but different directions - and probably more artifacts like chimerae.

(BTW, I've written a simulator for 454 data (flowsim), feel fee to contact me if you're interested in seeing this extended to paired end - or rather, mate paired - sequences.)

ADD COMMENT
0
Entering edit mode
9.4 years ago
sacha ★ 2.4k

I don't not understand how you set the depth with wgsim ?

ADD COMMENT
0
Entering edit mode

via read length, number of reads and the length of the input sequence?

ADD REPLY
0
Entering edit mode
9.4 years ago

RandomReads, in the BBMap package, supports paired-ends. For example:

randomreads.sh ref=ref.fa out=reads.fq paired interleaved reads=100k length=150 mininsert=200 maxinsert=400 gaussian
ADD COMMENT
3
Entering edit mode

I have started to have the feeling that everything is implemented in the BBMap package :-)

ADD REPLY
0
Entering edit mode

That's my ultimate goal... haven't quite reached it yet!

ADD REPLY
0
Entering edit mode

Hi Brian! Is it possible to generate reads in specific intervals? WES-like read simulation?

ADD REPLY
0
Entering edit mode

No, unfortunately not. You'd have to use something like bedtools to pull out the exome fasta using the genome fasta and the bait coordinates, and then use RandomReads on the result. I don't currently have anything to parse bed, but that does seem like a good addition to RandomReads.

ADD REPLY
0
Entering edit mode

Thank you for the fast answer. I'll try the bedtools pre-step. Another issue.. I've realised that in PE mode, the names of the output reads in the two files are not paired, is there any option for this?

ADD REPLY
1
Entering edit mode

Yes - add the flag "illuminanames".

ADD REPLY
0
Entering edit mode

Is that possible to generate RNA-seq reads from BBmap?

ADD REPLY

Login before adding your answer.

Traffic: 1721 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6