Question

Fishing out specific sequences from large PacBio bax.h5 files

1

Entering edit mode

7.9 years ago

roblogan6 ▴ 50

I have 125,000 individual reads from PacBio in fasta format, processed from bax.h5 files. I have clustered these reads based on unique molecular identifiers. I would now like to align these individual reads per cluster to a reference genome using the PacBio SMRT portal module blasr.

I am interested in using the bax.h5 information rather than simply the fasta files for the alignment. Is there anyway that I can use the fasta headers to make a whitelist to call the read information from the large bax.h5 files to fish out the associated information?

When I use ConsensusTools to generate a Long Amplicon Analysis for example, there are command line options for using a "file of file names" to then go and get the information from a whitelist. There are no such options for blasr, but I wonder if there is a way to do it before hand? How can I use only a small, defined subset of reads from the large bax.h5 files for blasr? Thanks for any help or suggestions.

PacBio alignment blasr next-gen • 2.1k views

ADD COMMENT • link 7.9 years ago by roblogan6 ▴ 50

score 0 · Answer 1 · 2017-02-08

I had sent an email to PacBio technical support about this and got the following response, for those who might be having the same problem:

My name is Roberto Lleras, Bioinformatics FAS Manager at PacBio. I'd be happy to answer your question. In order to manually look through the bax.h5 files in order to select specific reads to use in BLASR, you'd need to utilize the pbcore.io python library and write custom scripts to create new H5 files that only contained your filtered reads. Information on the functions of pbcore.io can be found here: http://pacificbiosciences.github.io/pbcore/pbcore.io.html#bas-h5-bax-h5-formats-pacbio-basecalls-file

Alternatively, you can align everything and then filter poor alignments with the cmph5tools.py software included with SMRTAnalysis. Information on filtering datasets with cmph5tools.py can be found here: https://github.com/PacificBiosciences/pbh5tools/blob/master/doc/cmph5tools-examples.rst