Question

PacBio interpulse duration (IPD) data

5

Entering edit mode

6.7 years ago

bgbrink ▴ 60

I started working on my first project with PacBio sequencing data, and after 2 days filled with fruitless googling, missing libraries and failed c compilations, I decided it's time to ask for help.

The information on tools and pipelines for PacBio data is scattered everywhere and most of it seems horribly out of date. I was hoping some of you could help me to get started and other lost souls in the future will hopefully find this post and save some time and frustration.

The data:

I have a set of Primary Analysis Data available, as explained in the first paragraph here.

The tools:

After some struggle, I managed to compile the latest version of blasr and pbalign on Ubuntu 16.04 using pitchfork. I also managed to install the R packages h5r, pbh5, and seqPatch. If anyone is reading this and has trouble with R and HDF5 libraries under Ubuntu, see my question here.

What I don't have:

I don't have access to a server with the SMRT Link platform.

What I would like to do:

I would like to access the interpulse duration (IPD), as explained in this white paper, and preferably access this data in R. I am open to suggestions for other tools/programming languages as well though.

Problem:

I need cmp.h5 files to load the IPDs from the R packages. How do I generate those? When I try to run pbalign, it says pbalign no longer supports CMP.H5 Output in 3.0. Is there any other way to get to the IPDs without going through cmp.h5 files?

Thank you very much for your time.

(Edit note: I realized my questions were too broad and specified a single problem to start with instead.)

sequencing next-gen software error alignment • 5.3k views

ADD COMMENT • link updated 5.9 years ago by lr65358 ▴ 20 • written 6.7 years ago by bgbrink ▴ 60

2

Entering edit mode

PacBio has a wiki dedicated to training for PacBio data. In case you have not discovered it.

Here is technical info about h5 format PacBio uses and the tools they provide.

ADD REPLY • link 6.7 years ago by GenoMax 147k

0

Entering edit mode

I did see the training, but it was not very helpful, since most of it is tailored to PacBio's SMRT Portal/SMRT Link platform (what's the difference anyway?). I will have another look though, since I also missed the python script you mentioned. Thanks a lot for pointing that out!

ADD REPLY • link 6.7 years ago by bgbrink ▴ 60

1

Entering edit mode

5.9 years ago

lr65358 ▴ 20

You need an older version of pbalign.

conda install -c bioconda blasr

git clone https://github.com/PacificBiosciences/pbalign.git

cd pbalign

git checkout 6c8618cfee963e2167100cb0b293aedf85f32dcf

sudo pip install .

ADD COMMENT • link 5.9 years ago by lr65358 ▴ 20

score 5 · Accepted Answer · 2018-03-07

What organism is your data from?
What is the genome size of the organism? What Coverage do you have?
What PacBio instrument was your data generated on RSII (output is a bax.h5 file) or Sequel (output is an unaligned .bam file)?

The instrument that the data was generated on is important as it will determine the BFX tools you can use for analysis.

SMRT Portal is designed to be used with RSII data and accept raw bax.h5 files for input. It was last updated in Nov 2014 so unfortunately not actively maintained. You will be reserved to scrolling through outdated GitHub info and the interwebs for analysis help.
SMRT Link is designed to be used with Sequel data and accept an unaligned .bam file for input. It is somewhat backwards compatible with RSII data. There is a bax2bam command that allows you to convert RSII data to the same unaligned bam format of Sequel data. This is sufficient for most applications but I don't think it works (correct me if I am wrong?) for base mod work because the IDP info is not conserved upon file conversion.

I am not aware of a way you can get around using a cmp.h5 file for IDP information. It is also likely it will be hard to get out of downloading and using either SMRT Portal of SMRT Link in one way or another. Luckily both of them can be used relatively easily on a workstation (dependent on genome size of your organism). SMRT Portal can be run in GUI format on a workstation and SMRT Link can be run in command-line only format (without having to set up a full SMRT Link server).

SMRT Portal

Here is the download for SMRT Portal go to the bottom and click the link "Previous release of SMRT Analysis for RSII"
Here is a SMRT Portal Help Page this can help you use the GUI
Here is a Base Mod Technical Note You should look at this walkthrough regardless if you use SRMT Portal or SMRT Link as the output files are similar and so is the process

SMRT Link

Here is the download for SMRT Link
Here is official instructions for downloading SMRT Link see page 8 for the command line only tools.
Here is a Biostars link to instructions on how to install the command-line only tools It is wayyyy better than the official instructions.
Here is the SMRT Tools Reference Guide - It is an in-depth list of all the commands possible in smrt link and their options. Check out pages 41 -43 for MotifMaker.
Check out pages 52-58 of the reference guide for pbsmrtpipe, this will allow you to run the whole ds_motif_modification_analysis pipeline that allows you to generate the .csv file seen in the whitepaper. Specifically page 58 shows the command that would run this pipeline (with a slight ID modification).

Once you have ran a base mod pipeline in either SMRT Portal or SMRT Link (via pbsmrtpipe) you should have output .csv, .gff. and cmp.h5 files that you can do tertiary analysis on using whatever you want. There are also a few tools available from PacBio that run downstream of the initial analysis.

PacBio Base Mod tools this is the link to their GitHub of additional tools. It looks like only kineticsTools, MotifMaker, and MotifFinder have been updated for Sequel data.

There are also a handful of methods developed by other researchers to use IDP / base mod data from PacBio data. You could look at some of the published papers to get additional ideas.

Many papers that utilize this type of IDP work are on PacBio's Website look at the publication section at the bottom of the page.
Metagenomic binning and association of plasmids with bacterial host genomes using DNA methylation. This one would be cool to see what bfx pipeline they did. It is from Mt Sinai and SEMA4. Unfortunately it is behind a paywall.
AgIn: measuring the landscape of CpG methylation of individual repetitive elements Agln is unique because used the IDP data to map regional methylation (CpG) islands in eukaryotic organisms (large genomes) at only 20-40 fold coverage of PacBio data.