Entering edit mode
7.8 years ago
pmarijon
▴
140
What's a good PacBio CLR read simulator?
We could not use:
- SiLiCO source doesn't generate quality values
- FastqSim source has a bug where it outputs spurious A's and C's after 6 kbp
- SimLord source outputs only CCS reads
- pbsim source does not compile
- readsim source - No control on quality value, we tried to assemble 60x E.coli simulated reads and obtained very fragmented assemblies with Canu 1.4
We were able to use:
- LongIslnd source but is quite heavy-weight, it requires to install SMRTAnalysis (!) to generate model files. Model files aren't provided (this wasn't clear in the documentation), and it took over an hour to generate them myself. But helpful automated scripts were provided.
- BBMap's RandomReads source seems work, and easy to install (thanks to Brian Bushnell)
- NPBSS source MATLAB OCTAVE seems work but maybe not support multi-line FASTA.
Did you know other PacBio CLR read simulator ?
Edit : add BBmap's RandomReads and NPBSS
I think you have a good list, In my case I go for pbsim maybe you need to post the error here so someone could help you.
Has you can see in this compile log it's a linking trouble, I think the build system forget some file.
did you read this or try the suggestions?
You can give configure initial values for configuration parameters by setting variables in the command line or in the environment. Here is an example:
./configure CC=c99 CFLAGS=-g LIBS=-lposix
This issue explain why the build system is broken and alternative solution to build pbsim. Thank
http://www.nature.com/nrg/journal/v17/n8/full/nrg.2016.57.html
if cannot access the paper use sci-hub or gen-lib
Thank,
I read this publication, I didn't test EAGLE but they have a trouble with boost when it's upper than 1.56
Not sure what the intended use case is but you want to simulate the reads from a specific genome? Otherwise enough original PacBio data is available now. PacBio makes several sets available here.
It is useful for machine learning. I am currently trying to find a single working CLR simulator to control variant mutations for a deep learning based variant caller. There is plenty of real CLR data available, but few places to find quality variant calls to train on for public download. Until there is "Truth" variant set like https://jimb.stanford.edu/giab for prokaryotic genomes, simulators are the next best thing.
troysincomb Genome in a bottle (LINK) project has several well characterized datasets available. Some are PacBio so check them out.