This tutorial addresses to anyone in possession of PacBio sequencing data. After struggling myself to find and test appropriated tools, learning PacBio specific format, asking for precision to kind PacBio guys, I've managed to gather a lot of useful information that everyone may need in order to process their data.
I will mainly focuses one of the most important step while doing PacBio genome assembly : the polishing step. Indeed, higher indel rate within PacBio raw reads make it necessary to correct the sequence obtained after assembly. This step consist in two things :
- aligning your raw PacBio reads against the reference (your freshly baked PacBio assembly)
- correct the reference base by base to obtained a polished assembly, which will become your final assembly.
PacBio changed their tools lately, getting ride of some strange format (cmp.h5). But the latest tools are not directly available on pacBio's github, and the one proposed in github are obsolete or really hard to install (I'm referring to pitchfork). Using advice from Mr Stucki working at PacBio, I managed to install easily and quickly all PacBio tools required for polishing. Today, I'm going to describe all the steps I've followed so anyone could access as well at PacBio tools.
Installation of PacBio tools
1. Download SMRT. Link: wget https://downloads.pacbcloud.com/public/software/installers/smrtlink_5.0.1.9585.zip; unzip smrtlink_5.0.1.9585.zip -d destination_folder
Just to warn you, it seems that SMRT Link is not supported on Windows. The steps I'm describing were tested on Ubuntu 14.04 Trusty Tahr.
SMRT Link is a big tool that include all services PacBio can offer. If you have access to a cluster facility and if you match the requirements, you should probably follow classic installation instructions. However, if you just need one or 2 of PacBio tools, you probably don't want to install the whole suit, which is very heavy to install. Here what you should do :
2. Add a new non-root user "smrtanalysis": sudo adduser smrtanalysis
3. Define in your bash what SMRT_USER
is: SMRT_USER=smrtanalysis
4. Log on as the freshly created user : su $SMRT_USER
-> Extra tip : type bash
to grab your bash environment and have access to auto-completion while typing in terminal
5. Define SMRT_ROOT
: SMRT_ROOT=opt/pacbio/smrtlink
-> Extra tip : The $SMRT_ROOT
directory must not exist when the installer is invoked, or installation will abort
6. Launch SMRT analysis with option "tools-only": ./smrtlink_5.0.1.9585.run --rootdir $SMRT_ROOT --smrttools-only
7. After installation completed, move into directory that contains PacBio binaries : cd /opt/pacbio/smrtlink/smrtcmds/bin
-> Extra tip : add this directory in your PATH bash variable, so you will be able to launch PacBio tool from anywhere
8. Congratulations ! You can now use every PacBio tools ! Here what you are supposed to have within this directory:
arrow DBdust ice_polish pbsmrtpipe
bam2bam DBrm ice_quiver pbsv bam2fasta
DBshow ipdSummary pbsvutil bam2fastq DBsplit
ipython pbtranscript bamtools DBstats
ipython2 pbvalidate bamtools-2.4.1 dexta
juliet picking_up_ice bax2bam fasta2DB
julietflow python blasr fasta-to-gmap-reference LA4Falcon
python2 ccs fasta-to-reference LA4Ice
python2.7 cleric fuse laa quiver
daligner gmap LAmerge REPmask
daligner_p gmap_build LAsort samtools
datander gmapl motifMaker sawriter dataset
HPC.daligner ngmlr separate_flnc dazcon
HPC.REPmask pbalign summarizeModifications DB2Falcon
HPC.TANmask pbdagcon TANmask DB2fasta
ice_combine_cluster_bins pbindex undexta DBdump
ice_partial pbservice variantCaller
Assembly polishing with arrow
Now that you successfully installed PacBio tools (I hope !), you can start to do use them. As i said, I'm going to focus on the polishing step. The two main tools for polishing are arrow and quiver, two different algorithms included in the variantCaller tool (the polishing tool). If you already wonder about the differences about the two tools, here the differences between the two of them :
Quiver is supported for PacBio RS data. Arrow is supported for PacBio Sequel data and RS data with the P6-C4 chemistry.
As I understand it on PacBio github (https://github.com/PacificBiosciences/GenomicConsensus), the latest and thus more efficient tool is arrow. As I was myself very disappointed by quiver result (my case is particular, high polymorphism degree), I'm only going to describe polishing steps with the latest arrow algorithm.
Before to describe steps, I'm going to explain a few of my vocabulary about PacBio data. PacBio result, at least for my project, but I heard it was all the same everywhere, are organized by runs. Each run can contain several folder corresponding to one SMRT_cell. In an SMRT_cell directory, you normally have 3 .bax.h5 files, and 1 .bas.h5 files. What I call an SMRT_cell are the 3 bax.h5 files.
Also, something really important I've made, you may probably want to know the polishing coverage you are going to use. If you have a small data set with reasonable coverage (40-50X), then you should probably use all your raw reads for polishing. However, if you dataset is bigger, then you should probably take a subset of your raw reads. I've heard that polishing at very high coverage (>100X), take longer without improving considerably the quality of your assembly. 80X polishing coverage is what I used for correcting my assembly. You can roughly estimate your polishing coverage by converting one SMRT_cell (3 bax files) into fasta format by using the tool pls2fasta (is present in blasr package). Then you can count the total numbers of bases for that SMRT_cell and calculate your coverage with the size of your genome species.
An example of pls2fasta command I used: pls2fasta rawReadPB_s1_p0.2.bax.h5 rawReadPB_s1_p0.2.fasta -trimByRegion
Now everything is explained, let's get thing started :
1. Create a .fofn file for each SMRT_cell. A fofn file (File Of Files Names) look like this :
/media/loutre/GENOMIC/pacbio_reads/run2/A02_1/Analysis_Results/rawPb1_s1_p0.1.bax.h5
/media/loutre/GENOMIC/pacbio_reads/run2/A02_1/Analysis_Results/rawPb2_s1_p0.2.bax.h5
/media/loutre/GENOMIC/pacbio_reads/run2/A02_1/Analysis_Results/rawPb3_s1_p0.3.bax.h5
I can't save you with an extra tip this time.. I did this by hand (maybe someone will give a tip in comment ?)
2. For each .fofn file, launch bax2bam
: a tool that will simply convert your raw bax file into a more convenient format. But be careful, theses BAM are not alignment yet. Here some bash scraps that may help you starting (sorry code insertion was not working for this):
FOFN_DIR="/media/loutre/GENOMIC/pacbio_reads/fofn"
COMPT=1
for FOFN in "$FOFN_DIR"/* do
echo "Dealing with file "$FOFN
./bax2bam -f $FOFN -o $BAM_DIR"/SMRT-cell-"$COMPT --subread --pulsefeatures=DeletionQV,DeletionTag,InsertionQV,IPD,MergeQV,SubstitutionQV,PulseWidth,SubstitutionTag
COMPT=$((COMPT+1))
done
This command will create several bam files :
.scraps.bam : contains all the sequences that were filtered out: low quality regions of the polymerase reads, adapter sequences (, barcodes, control sequences). These were stored (and labelled) in the bax files, and are basically “split” to a different BAM file when converting.
.subreads : your high-quality reads that you want to use for pbalign
.pbi : pacBio index
3. For each PacBio subreads.bam , launch pbalign to align your reads against your reference. This can produce either a bam or a sam file. This time, the output is an alignment. You can launch pbalign as following :
ALIGN_DIR="/media/loutre/GENOMIC/pacbio_reads/pbalign"
for PB_BAM in "$BAM_DIR"/*"subreads.bam"
do
echo "Dealing with file "$PB_BAM
./pbalign --nproc 32 $PB_BAM your/assembly.fasta $ALIGN_DIR"/align_"$COMPT".bam"
COMPT=$((COMPT+1))
done
-> Extra tip : don't forget to adjust the number of cores to your own possibility
4. Merge all the produced bam files using samtools merge
-> Extra tip : you will probably have to sort all your bam files before merging them.
5. Use arrow using your reference file and your merged bam. My advise is to personalize this step depending on your need. Several options are proposed, like a diploid option to allow more polymorphism, also a lot of read selection filtering options are available. So no copy/paste for this time ! -> Extra tip : You should have an eye on arrow help : arrow -h
... Obvious Ostrich strikes again !
I really hope all theses informations could be useful to anyone. I really struggled to gather all this information. Took me almost 2 hours to write this tutorial. Of course, this is not the absolute way of using PacBio tools ! But if you are lost, starting with PacBio tools, you can use this as a start to your analysis...
I'm open to feed back, corrections, precision, questions... I just wanted to help!
Cheers,
Roxane
Hi Roxane, thanks for this nice howto. That's what is missing in the PacBio docs and also in Canu docs.
You could use a dataset when aligning the raw reads back to the assembly.fasta, in this case you would only need one single "pbalign" call and there would be no need to merge BAM files.
and then
Dear sklages, the right syntax for dataset create is :
Thanks for that tip ! So when you create the dataset, you can specify the desired coverage for example ?
No, I don't think so. With command "create" it simply creates a XML file with all relevant infos about the BAM files.
For pacbio sequel data, I am running below command which fails with exit status
Is that command fine? I am looking at this link
sample.fa
: is the pacbio sequel fastq file converted to fasta format using samtoolssample.scaffolds.fa
: is the scaffold fasta file. scaffolding of canu assembled contigs was done using SMISThanks so much Roxane - that's really great and will be helpful for the future! I may be using PacBio.
You're welcome ! Glad it could help :)
Hello, Thank you for the post. Do you have any idea about how to improve available assembly using PacBio reads?
I never had to do such a thing, only tried de novo assembly. I would suggest you to read these post (if it's not already done), you can probably grab some useful answers : Gap-filling and scaffolding using PacBio reads Scaffolding with pacbio reads
Thanks Roxane!
I am about to polish my contigs, and your post is really helpful! My pacbio reads have primer sequences at the both ends of reads ( ~ 60nt). So I ended up having to generate the CCS reads, trim the adapters and assemble it with Canu. But since I can't modify the sam file of subreads, I am afraid these adapters may ruin the alignment and generate errors in polishing. Do you have any idea about this?
Hi Kyungyong !
So I'm not really used to PacBio dataset with adaptaters, but I saw an option in the bam2bam tools from pacbio that allow to specify adaptaters and barcodes sequence. Maybe it allow a kind of filter to put apart adapaters form the output bam ?
Here is the SMRT Tools Guide that may be really useful : http://www.pacb.com/wp-content/uploads/SMRT-Tools-Reference-Guide-v4.0.0.pdf
I also found an other pacbio tools called lima : https://github.com/PacificBiosciences/barcoding
To me, it look like something you can do. But maybe I'm wrong... Tell me if theses were useful to you !
Roxane
Dear Roxane,
Thanks for a detailed post on Pacbio. I had been working on Illumina data but now I have been assigned to work on Pacbio data. Since I am new to this, I have few queries in mind. Firstly: I have 3 bax.h5 and 1 bas.h5 files other than .csx ans .xml file. If i am not mistaken, bas.h5 file only contains the quality information for the reads. So, for analysis we require .bax.h5 files. Also each .bax.h5 files correspond to different sample. After installation of SMRT, the first step is polishing using bax2bam that will remove all low quality and adapter sequences from the reads. After aligning the subreads.bam file to reference, why do we need to merge all bam files. And you you mean merge all bam files coming from each .bax.h5.
I need to perform quasispecie analysis for my data. once i have bam file using pbalign can i directly use the bam file to reconstruct quaasispecies or there are other necessary steps to take before i do that.
Much Thanks, zusman
Dear Zusman,
Everything you said is correct so far.
But here, I'm not sure you clearly understood the process. bax2bam is not yet the tool that will perform polishing. bax2bam allow you to convert the "weird" raw PacBio format into a more convenient format : bam (note this is not yet an alignement). So bax2bam does not remove any read, it just convert raw reads into bam reads.
Not really sure about your question here. If you have follow my logic above (but once again, this is a way of doing it, not necessary THE way), you will end with a alignement bam file for each SMRT cell (3 bax.h5 file). You are merging the bam file because arrow (this one is the polishing tool), takes only one alignement file as an input.
I have absolutely no idea (yet) of what quaasispecies is. I think you should explain this to me a little bit further.
I'm not sure I answered your question. Tell me if I missed something in your questions.
Cheers,
Roxane
Can you guide me how i can process my data. After installation, i need to perform polishing and as you mentioned you used Arrow. Can you mention the command you used. So Arrow removes the low quality reads and adapters?? Once i have performed polishing using Arrow, i can use pbalign to align my each of filtered .bax.h5 files. Or is it that we dont need to perform filtering and directly align against reference. I am confuse how pacbio works. Quasispecies is set of different variants within a viral population. My aim to to check for every patient at different timepoints how viral population behaved. For that i have to evaluate each bax.h5 file and extract information of possible quasispecies and compare across different timepoints.
arrow works on aligned data. Maybe you should read [1] to get an idea what arrow is doing.
Best, Sven
[1] = https://github.com/PacificBiosciences/GenomicConsensus/blob/master/README.md
This really help me a lot as I am new too. Do you have any idea to classify pacbio subreads into full length reads and non-full length?
Hello ! Well, I never did such a thing yet, I don't have any idea of how to do it, I'm sorry.
I'm trying to run PBalign as a step toward running Arrow. I have a large subreads.bam with ~1400x coverage. I have an alignment built with 200x coverage. I attempted to downselect my bam file with samtools to around the same coverage level but the new downselected bam file causes BLASR to freak out. I looked at the two files and they look different. any suggestions on how to make this work?
You should post this as a new stand-alone question.
These comment paths are hard to organiz, yikes.
On main Biostars page (or this page) find and click the
New Post
button at top right of the page. Paste the text of your question in there. Add relevant tags (no#
sign needed before tag names).I have the subreads.bam file and the assembly from canu. Can I just use the steps 1 (pbalign) and 2 (arrow) as mentioned above? Can you please point the arrow commands?
Hello Vijay ! As I wrote that post almost one year ago, I'm pretty sure that the PacBio command lines for arrow and quiver changed already... Also, sadly, I had to use quiver instead of arrow in the end because I had an error even using the safe install I described. That error was only when using arrow, so I never was able to try it in the end. I'm now working in a different workplace so I can't even retrieve what I tried at this time..
Hi Roxane
thanks for the response. I am happy that you looked at my comment even at this point since the original post is an year old. I understand that things can change pretty quickly specially for the evolving long read sequencing. Thanks again. I am still looking through other posts and struggling as you did before writing this post. If I find something, I will definitely post it here.
cheers
Vijay
Miss Boyer,
Thank you for your tutorial. I have a situation that I do not know how to solve. I am assembling a bacterial genome, but it seems to be the first one of its species that is sequenced. I do not have a reference genome for this species. What could I use as a reference genome? A bacteria from the same family? or E. coli? Regards Cecilio
Hi Cecilio !
You're welcome, glad this could help !
Concerning your issue, so you are doing do novo genome assembly. Do you have the estimated size of the bacteria's genome ? What sequencing technology did you used ? (pacbio ? illumina ?) In my opinion, you don't necessarily need a genome reference to perform de novo assembly. I didn't used one for my D. suzukii genome !
Miss Boyer,
Thank you for your answer. I am doing de novo assemblies of two bacteria. The sequences were obtained from pacbio. I am using Canu 1.7 & 1.8 in two different clusters. For one cluster (an Intel cluster) I am using Canu 1.8. For the other cluster (IBM) I am using Canu 1.7. The estimated size of the genomes are known.
I understand now that I do not need a reference sequence for assembling ( because I am ding a de novo assembly). However, for the next stage, "polishing", arrow needs a reference sequence (I have seen the notes on the usage of arrow). What reference sequence should I use for polishing?
Also, I have some issues with the assemblies. It seems that there may be some DNA from undesired bacteria in the pacbio output (perhaps contamination of the bacterial cultures).
I was advised to use BWA-mem (https://github.com/lh3/bwa) to map the reads to the known contaminant (I got the contaminant bacteria sequence from GenBank). After mapping, I should discard the reads that map to the contaminant and use the un-mapped reads for the assembly of my desired bacteria. This sounds good for one bacteria.
For the other one, I have got only unplaced-contings for the contaminant (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Moellerella_wisconsensis/latest_assembly_versions/GCF_001020485.1_ASM102048v1/GCF_001020485.1_ASM102048v1_assembly_report.txt) from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Moellerella_wisconsensis/latest_assembly_versions/GCF_001020485.1_ASM102048v1
How to map my reads to such a set of contigs (108 unplaced ones)?
Sorry to overwhelm you with so many questions.
Regards,
Cecilio1
Hello again Cecilio !
To make it clear here, the reference I"m talking about here ...
...is actually your assembly. So it means you have to align your raw reads against the assembly you just made.
This post I made may not be that "old", years wise, but I don't think it is updated anymore... PacBio were changing all their tools, their formats and their technologies at that moment... I think this whole tutorial really only works with "old" data produced with the "old" PacBio machines (RSII I think ? correct me if I'm wrong), not the Sequel.
Since that day, I've switched on the Oxford Nanopore sequencing technologies and I have to admit I'm not updated anymore on the tools and methods used for PacBio :/.
So I'll say this tutorial still helps to understand what the polishing step is, but you may want indeed to make a new post and get the properly updated tools. I'll try to have a look on the nowadays PacBio methods on my side, and I'll give you my discoveries here (or on your new post), or perhaps when I'll have even more time, give a fresh update to this tutorial :o)
Have a nice day !
Cheers,
Roxane
Miss Boyer,
Thank you for the information. And thank you for your commitment to help others to understand the process of genome polishing. After I solve the contaminant issues on my reads I will move on to polishing.
Best regards,
Cecilio
cecilio11 : You may want to post part about separating the contaminants as a new question to get maximum visibility.
You can use a set of contigs to align against by creating a reference index from them.
Sir genomax, Is samtools the right tool to create a reference index from the unplaced contigs? Regards, Cecilio1
If you are using
bwa mem
for alignments you can usebwa index
for creating the index.That said since you have PacBio reads then you may want to look at
minimap2
(from author of bwa but for long reads) instead as aligner of choice (https://github.com/lh3/minimap2 ).Sir genomax, I will give minimap2 a try. Thank you for your suggestion. Cecilio1
Sir/Madam genomax,
I did run minimap2 by using the fastq "contaminated" reads against the "contaminant" bacteria sequence. I used a simple command line: minimap2 -a Ref.mmi XXXXX.subreadsFQ.fastq > subreads_alignment.sam Today, Dec 1tth, the cluster I used to run this command is under maintenance and I cannot check the results.
I assume that the next step will be to extract a list of the names of the sequences that mapped to the reference (i.e., seqeunces in the "subreads_alignment.sam" file) and remove those read from the original file (i.e., "XXXXX.subreadsFQ.fastq").
To remove the sequences in a list I will generate from "subreads_alignment.sam", I plan to use the tools suggested in a previous biostars post: One of them is: qiime filter_fasta.py -f in.fastq -o out.fastq -s list_of_sequences.txt -n We have quiime in the cluster, so I can run the above command.
One thing that worries me is that the line-command I used for mnimap2 allows for 15% divergence. I wonder if this will be a problem with closely related species, because it could potentially remove sequences that are "valid" for the target genome.
Have you used the option "asm5" (~5% divergence) or "map-pb" for pacbio reads?
Regards,
Cecilio
Hi, my subread bam file is too big, around 300G, so when I am running pbmm2, the job always got killed for the limit of memory reason. Do you have any better idea to solve this kind of problem?