I started working on my first project with PacBio sequencing data, and after 2 days filled with fruitless googling, missing libraries and failed c compilations, I decided it's time to ask for help.
The information on tools and pipelines for PacBio data is scattered everywhere and most of it seems horribly out of date. I was hoping some of you could help me to get started and other lost souls in the future will hopefully find this post and save some time and frustration.
The data:
I have a set of Primary Analysis Data available, as explained in the first paragraph here.
The tools:
After some struggle, I managed to compile the latest version of blasr and pbalign on Ubuntu 16.04 using pitchfork. I also managed to install the R packages h5r, pbh5, and seqPatch. If anyone is reading this and has trouble with R and HDF5 libraries under Ubuntu, see my question here.
What I don't have:
I don't have access to a server with the SMRT Link platform.
What I would like to do:
I would like to access the interpulse duration (IPD), as explained in this white paper, and preferably access this data in R. I am open to suggestions for other tools/programming languages as well though.
Problem:
I need cmp.h5 files to load the IPDs from the R packages. How do I generate those? When I try to run pbalign, it says pbalign no longer supports CMP.H5 Output in 3.0
. Is there any other way to get to the IPDs without going through cmp.h5 files?
Thank you very much for your time.
(Edit note: I realized my questions were too broad and specified a single problem to start with instead.)
PacBio has a wiki dedicated to training for PacBio data. In case you have not discovered it.
Here is technical info about h5 format PacBio uses and the tools they provide.
I did see the training, but it was not very helpful, since most of it is tailored to PacBio's SMRT Portal/SMRT Link platform (what's the difference anyway?). I will have another look though, since I also missed the python script you mentioned. Thanks a lot for pointing that out!