Question

Pacbio Quiver Consensus - How To Use?

1

Entering edit mode

11.6 years ago

darxsys ▴ 240

I'm working on a project in which I need to simulate mapping of short reads to long reads of a genome. I have come across this page: https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/HowToQuiver.rst which offers a software for the consensus fase of the mapping. However, since I'm trying to do this for the first time, I don't know how to use this. I see that this program wants a cmp.h5 file as input, but how can I generate a file like that? What tools produce files like these? I know these files are a special format originating from PacBio, but how can I produce them?

For example, I have a whole E.Coli genome. I then sequence it using PBSim to produce very short (100 bp) and very long (10k bp) reads in fastq format. Now, I would like to map short ones to each long one and I need consensus software for that. Acutally, I don't even know which software to use for the first fase (before consensus), too (the one which would, I assume, give me as output cmp.h5 file needed by Quiver). Any help appreciated.

consensus • 13k views

ADD COMMENT • link updated 11.6 years ago by mchaisso ▴ 160 • written 11.6 years ago by darxsys ▴ 240

0

Entering edit mode

It would help if you explain what the purpose of the exercise is. Are you trying to error correct the long reads, such as done with pre-assembly in HGAP https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP, or PacBioToCA http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=PacBioToCA pipelines?

ADD REPLY • link 11.6 years ago by lexnederbragt ★ 1.3k

score 1 · Answer 1 · 2013-06-02

1

Entering edit mode

11.6 years ago

Tky ★ 1.0k

Please have a check on Allora from SMRT Analysis tool

Allora, short for "a long read assembler," is PacBio's de novo assembly algorithm. Based on the open source assembly software package AMOS as well as other components tailored to PacBio’s long reads and error profile, Allora uses an overlap-layout-consensus approach to iteratively assemble raw reads into contigs and then outputs them as Fasta sequence and cmp.h5 files.

ADD COMMENT • link 11.6 years ago by Tky ★ 1.0k

0

Entering edit mode

Thanks for help. I can't seem to find any link to download Allora or SMRT Analysis tool however.

ADD REPLY • link 11.6 years ago by darxsys ▴ 240

0

Entering edit mode

Take a look at download section of the following page http://pacbiodevnet.com/

ADD REPLY • link 11.6 years ago by Tky ★ 1.0k

0

Entering edit mode

In documentation of quiver they say that aligned reads should be on input in .cmp.h5 or .bam format... Aligned reads to what, I had some troubles to run pbalign. What do you think, would it be possible to use different mapper??

ADD REPLY • link 8.8 years ago by kamiljaron ▴ 230

0

Entering edit mode

Maybe it is too late to add a reply there, but I also had a lot of trouble understanding this, so if someone is still struggling, maybe he will find some hope in my answers !

Before using Quiver, you should produce a cmp.h5 file, which correspond to an alignment of your PacBio reads against a reference ( your genome assembly for example).

Here something you can try : pbalign --forQuiver your_movie.bas.h5 your_reference.fasta out.cmp.h5

I think that if you have several bas.h5 files, you can provide a fofn file ( which contains the path of all your different bas.h5 files)

I think that other mapper won't be able to read the specific bax.h5/bas.h5 PacBio format. I heard that they want to get rid of this strange format : https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/HowTo.rst

In this link they say :

(...) This is inefficient and users attempting to do this have run into many problems with the instability of the HDF5 library (which PacBio is moving away from, in favor of BAM.)

Maybe one day, thing are going to be easier ! But for now you have to start by pbalgin, and then you can use Quiver !

ADD REPLY • link 8.4 years ago by Rox ★ 1.4k

score 1 · Answer 2 · 2013-06-04

Some notes: Quiver is typically used at the end of an assembly, after overlaying the reads back on the assembly with an alignment. If you are looking into ways to do hybrid assembly consider PacBioToCA http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=PacBioToCA. If you are only doing bacterial assembly, I wouldn't pursue this too much as most prokaryote genomes assemble into a few or one contig without short read error correction: http://www.cbcb.umd.edu/software/PBcR/closure/report.log.krona.html .

Also, Quiver uses all of the quality values (InsertionQV, DeletionQV, SubstitutionQV, and MergeQV) stored in the bas.h5 files in order to have optimal consensus calling. PBSim only generates FASTQ. While I believe it is possible to use Quiver on this data, the results will be inferior to using the real data. There is a read simulator called "alchemy" that is tucked in with the blasr distribution on github (under the subdir 'simulator') that simulates all of these quality values, but it needs real data to train an error model on. Also, I've never tested the output of alchemy as input for Quiver, so I can't vouch that it works.

-mark

score 0 · Answer 3 · 2013-06-01

I don't us PacBio so this isn't going to be a complete answer for you, but from the file that you link to it seems that the cmp.h5 file is generated by the PacBio base calling software. However the file is in HDF5 format, which is open source. You could checkout this page for a specification of the cmp.h5 file format possibly make the file yourself.