I like ART for NGS. It's comprehensive and one can easily generate custom profiles from a given fastq samples, e-g- MiSeq 300bp or brand new HiSeq 250bp
Default pbsim reads come with a little more errors and shorter reads than what in my experience you get with real-life pacbio data. For convenience I use the script below to directly generate PacBio reads from a reference FASTA with length and error distributions closer to what you get from current P4-C2/P5-C3 runs. (Just put the code in a file, e.g. pbsim-clr, next to the pbsim binary and make it executable)
#!/bin/bash
BIN="$(dirname "$(readlink -f "$0")")";
COV=10;
[[ $1 == '-c' ]] && { COV=$2; shift; shift; };
[[ $# -eq 0 || $1 == "-h"* ]] && { echo -e "Usage:\n pbsim-clr [-c cov] FASTA [FASTA ..]"; exit 0; }
for FILE in $@; do
PRE=`basename ${FILE%.*}-pb`;
$BIN/pbsim \
--data-type CLR \
--depth $COV \
--length-max 50000 \
--length-min 1000 \
--length-mean 7000 \
--difference-ratio 8:62:30 \
--accuracy-mean 0.83 \
--model_qc $BIN/data/model_qc_clr \
$FILE
rm sd_0001.ref
I=1;
OUT=$PRE-`printf %02d $I`
while [[ -e $OUT.fq ]]; do
I=$(( $I + 1 ));
OUT=$PRE-`printf %02d $I`;
done;
mv sd_0001.fastq $OUT.fq
mv sd_0001.maf $OUT.maf
done;
The DAZZLER simulater only produces random sequences with PacBio length profile.
[edit] Edited to prevent any confusion about dazzler - see comments.
here is how it was written in the wiki:
Whole genome simulation can be performed with dwgsim. dwgsim is based off of wgsim found in SAMtools written by Heng Li, and forked from DNAA. It was modified to handle ABI SOLiD and Ion Torrent data, as well as various assumptions about aligners and positions of indels. Many new features have been subsequently added.
RandomReads
in the BBMap package has a fairly good PacBio mode. For example:
randomreads.sh ref=reference.fa out=pb.fa reads=10000 minlength=200 maxlength=15000 pacbio=true pbmin=0.13 pbmax=0.17
...will generate 10k reads of length 200 to 15000, with error rates between 0.13 and 0.17, split between insertions, deletions, and substitutions in a ratio and with lengths in accordance with our PacBio lab data.
How can I control the coverage? or reads=10000
is that means that I will have 10k reads covering all the genome so I need to multiply it by the mean read "200+15000/2" then divided by the genome size and have the coverage? is it CLR or CCS?
Meanwhile I tried to use the program with chromosome 3 in rice like this:
bbmap6/randomreads.sh -Xmx10g \
ref=/data/genomes_reference/rice/Oryza_sativa.IRGSP-1.0.27.dna.chromosome.3.fa \
out=/data/genomes_artificial/pacbio/rice/rice_pacbio.fa \
reads=95826 minlength=200 maxlength=15000 \
pacbio=true pbmin=0.13 pbmax=0.17
in the info file:
#chrom scaffolds contigs length defined undefined startPad stopPad
1 1 12 36429819 36404619 25200 8000 8000
Why I have 12 scaffolds? also in the rice_pacbio.fa
file I have the reads by number not by their genomic origin?!!! and it is from 0 to 95824 "which is equal to the number of reads reads=95826"
reads=X
will output X reads (unless you are outputting paired reads, which you aren't).
So yes, the coverage is (reads*(minlength+maxlength)/2)/(genome size)
These reads will be similar to filtered subreads. If you want something closer to ccs, then decrease pbmin
and pbmx
, which control the bounds of the error rates for individual reads. 3-pass ccs reads might be closer to pbmin=0.4 pbmax=0.7
.
As for the info file, you are looking at the wrong column - there is 1 scaffold which has 12 contigs. The number of contigs is not really important.
The read names are a little bit unclear, but... here's a sample read name:
0_chr1_0_407422_407571_399422_ecoli
To break that down, it is:
A_B_C_D_E_F_G
where:
0
: this is the read numberchr1
: This is for the internal coordinate system and you can ignore it0
: This is the strand. 0
means it came from the plus strand, 1
means the minus strand.407422
: This is the internal start coordinate407571
: This is the internal stop coordinate399422
: This is scaffold-relative start coordinate, zero-based (corresponding to pos=399423
in sam format)ecoli
: This is the scaffold name.This is confusing so I will add a new naming mode that is just ID_strand_start_stop_name
, e.g. 0_+_399422_399571_ecoli
, which will be easier to use.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Is "The reads must not exceed a maximum length of 36 bp?" about DAZZLER I think it uses static parameter to generate reads
There is definitely no length limit on read length in the latest version ART-VanillaIceCream-03-11-2014. However, I think default Illumina profiles are only available for 36-150 bp. For longer profiles, you will need to have/download an appropriate read set and profile the FASTQ file first.
About DAZZLER, you are right, it's a rather simple simulator, but it found it to be sufficient for my purposes (testing pacbio alignments / pacbio correction)
About DAZZLER you do not use the reference genome am I right? you just use the size of it
Sorry for the confusion. Yes, dazzler is just random sequence. I edited my original answer. Thanks for making me aware