Tutorial:Polish PacBio assembly with latest PacBio tools : an affordable solution for everyone
1
47
Entering edit mode
7.3 years ago
Rox ★ 1.4k

This tutorial addresses to anyone in possession of PacBio sequencing data. After struggling myself to find and test appropriated tools, learning PacBio specific format, asking for precision to kind PacBio guys, I've managed to gather a lot of useful information that everyone may need in order to process their data.

I will mainly focuses one of the most important step while doing PacBio genome assembly : the polishing step. Indeed, higher indel rate within PacBio raw reads make it necessary to correct the sequence obtained after assembly. This step consist in two things :

  • aligning your raw PacBio reads against the reference (your freshly baked PacBio assembly)
  • correct the reference base by base to obtained a polished assembly, which will become your final assembly.

PacBio changed their tools lately, getting ride of some strange format (cmp.h5). But the latest tools are not directly available on pacBio's github, and the one proposed in github are obsolete or really hard to install (I'm referring to pitchfork). Using advice from Mr Stucki working at PacBio, I managed to install easily and quickly all PacBio tools required for polishing. Today, I'm going to describe all the steps I've followed so anyone could access as well at PacBio tools.

Installation of PacBio tools

1. Download SMRT. Link: wget https://downloads.pacbcloud.com/public/software/installers/smrtlink_5.0.1.9585.zip; unzip smrtlink_5.0.1.9585.zip -d destination_folder

Just to warn you, it seems that SMRT Link is not supported on Windows. The steps I'm describing were tested on Ubuntu 14.04 Trusty Tahr.

SMRT Link is a big tool that include all services PacBio can offer. If you have access to a cluster facility and if you match the requirements, you should probably follow classic installation instructions. However, if you just need one or 2 of PacBio tools, you probably don't want to install the whole suit, which is very heavy to install. Here what you should do :

2. Add a new non-root user "smrtanalysis": sudo adduser smrtanalysis

3. Define in your bash what SMRT_USER is: SMRT_USER=smrtanalysis

4. Log on as the freshly created user : su $SMRT_USER -> Extra tip : type bash to grab your bash environment and have access to auto-completion while typing in terminal

5. Define SMRT_ROOT: SMRT_ROOT=opt/pacbio/smrtlink -> Extra tip : The $SMRT_ROOT directory must not exist when the installer is invoked, or installation will abort

6. Launch SMRT analysis with option "tools-only": ./smrtlink_5.0.1.9585.run --rootdir $SMRT_ROOT --smrttools-only

7. After installation completed, move into directory that contains PacBio binaries : cd /opt/pacbio/smrtlink/smrtcmds/bin -> Extra tip : add this directory in your PATH bash variable, so you will be able to launch PacBio tool from anywhere

8. Congratulations ! You can now use every PacBio tools ! Here what you are supposed to have within this directory:

arrow  DBdust  ice_polish  pbsmrtpipe
bam2bam  DBrm  ice_quiver  pbsv  bam2fasta  
DBshow  ipdSummary  pbsvutil bam2fastq  DBsplit
ipython  pbtranscript bamtools   DBstats                  
ipython2  pbvalidate bamtools-2.4.1  dexta                    
juliet  picking_up_ice bax2bam  fasta2DB                 
julietflow  python blasr  fasta-to-gmap-reference   LA4Falcon
python2 ccs  fasta-to-reference  LA4Ice     
python2.7 cleric  fuse  laa  quiver
daligner  gmap  LAmerge  REPmask
daligner_p  gmap_build  LAsort  samtools
datander  gmapl  motifMaker  sawriter dataset
HPC.daligner  ngmlr  separate_flnc dazcon         
HPC.REPmask  pbalign  summarizeModifications DB2Falcon
HPC.TANmask  pbdagcon  TANmask DB2fasta       
ice_combine_cluster_bins  pbindex  undexta DBdump         
ice_partial  pbservice   variantCaller

Assembly polishing with arrow

Now that you successfully installed PacBio tools (I hope !), you can start to do use them. As i said, I'm going to focus on the polishing step. The two main tools for polishing are arrow and quiver, two different algorithms included in the variantCaller tool (the polishing tool). If you already wonder about the differences about the two tools, here the differences between the two of them :

Quiver is supported for PacBio RS data. Arrow is supported for PacBio Sequel data and RS data with the P6-C4 chemistry.

As I understand it on PacBio github (https://github.com/PacificBiosciences/GenomicConsensus), the latest and thus more efficient tool is arrow. As I was myself very disappointed by quiver result (my case is particular, high polymorphism degree), I'm only going to describe polishing steps with the latest arrow algorithm.

Before to describe steps, I'm going to explain a few of my vocabulary about PacBio data. PacBio result, at least for my project, but I heard it was all the same everywhere, are organized by runs. Each run can contain several folder corresponding to one SMRT_cell. In an SMRT_cell directory, you normally have 3 .bax.h5 files, and 1 .bas.h5 files. What I call an SMRT_cell are the 3 bax.h5 files.

Also, something really important I've made, you may probably want to know the polishing coverage you are going to use. If you have a small data set with reasonable coverage (40-50X), then you should probably use all your raw reads for polishing. However, if you dataset is bigger, then you should probably take a subset of your raw reads. I've heard that polishing at very high coverage (>100X), take longer without improving considerably the quality of your assembly. 80X polishing coverage is what I used for correcting my assembly. You can roughly estimate your polishing coverage by converting one SMRT_cell (3 bax files) into fasta format by using the tool pls2fasta (is present in blasr package). Then you can count the total numbers of bases for that SMRT_cell and calculate your coverage with the size of your genome species.

An example of pls2fasta command I used: pls2fasta rawReadPB_s1_p0.2.bax.h5 rawReadPB_s1_p0.2.fasta -trimByRegion

Now everything is explained, let's get thing started :

1. Create a .fofn file for each SMRT_cell. A fofn file (File Of Files Names) look like this :

/media/loutre/GENOMIC/pacbio_reads/run2/A02_1/Analysis_Results/rawPb1_s1_p0.1.bax.h5
/media/loutre/GENOMIC/pacbio_reads/run2/A02_1/Analysis_Results/rawPb2_s1_p0.2.bax.h5
/media/loutre/GENOMIC/pacbio_reads/run2/A02_1/Analysis_Results/rawPb3_s1_p0.3.bax.h5

I can't save you with an extra tip this time.. I did this by hand (maybe someone will give a tip in comment ?)

2. For each .fofn file, launch bax2bam : a tool that will simply convert your raw bax file into a more convenient format. But be careful, theses BAM are not alignment yet. Here some bash scraps that may help you starting (sorry code insertion was not working for this):

FOFN_DIR="/media/loutre/GENOMIC/pacbio_reads/fofn"
COMPT=1
for FOFN in "$FOFN_DIR"/* do
   echo "Dealing with file "$FOFN
   ./bax2bam -f $FOFN -o $BAM_DIR"/SMRT-cell-"$COMPT --subread  --pulsefeatures=DeletionQV,DeletionTag,InsertionQV,IPD,MergeQV,SubstitutionQV,PulseWidth,SubstitutionTag
   COMPT=$((COMPT+1))
done

This command will create several bam files :

  • .scraps.bam : contains all the sequences that were filtered out: low quality regions of the polymerase reads, adapter sequences (, barcodes, control sequences). These were stored (and labelled) in the bax files, and are basically “split” to a different BAM file when converting.

  • .subreads : your high-quality reads that you want to use for pbalign

  • .pbi : pacBio index

3. For each PacBio subreads.bam , launch pbalign to align your reads against your reference. This can produce either a bam or a sam file. This time, the output is an alignment. You can launch pbalign as following :

ALIGN_DIR="/media/loutre/GENOMIC/pacbio_reads/pbalign"
for PB_BAM in "$BAM_DIR"/*"subreads.bam"
do
    echo "Dealing with file "$PB_BAM
    ./pbalign --nproc 32 $PB_BAM your/assembly.fasta $ALIGN_DIR"/align_"$COMPT".bam"
    COMPT=$((COMPT+1))
done

-> Extra tip : don't forget to adjust the number of cores to your own possibility

4. Merge all the produced bam files using samtools merge -> Extra tip : you will probably have to sort all your bam files before merging them.

5. Use arrow using your reference file and your merged bam. My advise is to personalize this step depending on your need. Several options are proposed, like a diploid option to allow more polymorphism, also a lot of read selection filtering options are available. So no copy/paste for this time ! -> Extra tip : You should have an eye on arrow help : arrow -h... Obvious Ostrich strikes again !

I really hope all theses informations could be useful to anyone. I really struggled to gather all this information. Took me almost 2 hours to write this tutorial. Of course, this is not the absolute way of using PacBio tools ! But if you are lost, starting with PacBio tools, you can use this as a start to your analysis...

I'm open to feed back, corrections, precision, questions... I just wanted to help!

Cheers,
Roxane

pbalign pacbio arrow smrt-limk • 23k views
ADD COMMENT
1
Entering edit mode

Hi Roxane, thanks for this nice howto. That's what is missing in the PacBio docs and also in Canu docs.

You could use a dataset when aligning the raw reads back to the assembly.fasta, in this case you would only need one single "pbalign" call and there would be no need to merge BAM files.

dataset create --type SubreadSet --name myGenomeSet.xml *.subreads.bam

and then

pbalign --nproc 32 myGenomeSet.xml your/assembly.fasta myGenomeSet.bam
ADD REPLY
1
Entering edit mode

Dear sklages, the right syntax for dataset create is :

dataset create --type SubreadSet --name MyGenome  MyGenome_Set.xml *.subreads.bam
ADD REPLY
0
Entering edit mode

Thanks for that tip ! So when you create the dataset, you can specify the desired coverage for example ?

ADD REPLY
0
Entering edit mode

No, I don't think so. With command "create" it simply creates a XML file with all relevant infos about the BAM files.

ADD REPLY
0
Entering edit mode

For pacbio sequel data, I am running below command which fails with exit status

date && time pbalign --tmpDir tmp --nproc 55 sample.fa sample.contigs.fasta out.sam

Is that command fine? I am looking at this link

sample.fa: is the pacbio sequel fastq file converted to fasta format using samtools

sample.scaffolds.fa: is the scaffold fasta file. scaffolding of canu assembled contigs was done using SMIS

ADD REPLY
0
Entering edit mode

Thanks so much Roxane - that's really great and will be helpful for the future! I may be using PacBio.

ADD REPLY
0
Entering edit mode

You're welcome ! Glad it could help :)

ADD REPLY
0
Entering edit mode

Hello, Thank you for the post. Do you have any idea about how to improve available assembly using PacBio reads?

ADD REPLY
0
Entering edit mode

I never had to do such a thing, only tried de novo assembly. I would suggest you to read these post (if it's not already done), you can probably grab some useful answers : Gap-filling and scaffolding using PacBio reads Scaffolding with pacbio reads

ADD REPLY
0
Entering edit mode

Thanks Roxane!

I am about to polish my contigs, and your post is really helpful! My pacbio reads have primer sequences at the both ends of reads ( ~ 60nt). So I ended up having to generate the CCS reads, trim the adapters and assemble it with Canu. But since I can't modify the sam file of subreads, I am afraid these adapters may ruin the alignment and generate errors in polishing. Do you have any idea about this?

ADD REPLY
0
Entering edit mode

Hi Kyungyong !

So I'm not really used to PacBio dataset with adaptaters, but I saw an option in the bam2bam tools from pacbio that allow to specify adaptaters and barcodes sequence. Maybe it allow a kind of filter to put apart adapaters form the output bam ?

Here is the SMRT Tools Guide that may be really useful : http://www.pacb.com/wp-content/uploads/SMRT-Tools-Reference-Guide-v4.0.0.pdf

I also found an other pacbio tools called lima : https://github.com/PacificBiosciences/barcoding

To me, it look like something you can do. But maybe I'm wrong... Tell me if theses were useful to you !

Roxane

ADD REPLY
0
Entering edit mode

Dear Roxane,

Thanks for a detailed post on Pacbio. I had been working on Illumina data but now I have been assigned to work on Pacbio data. Since I am new to this, I have few queries in mind. Firstly: I have 3 bax.h5 and 1 bas.h5 files other than .csx ans .xml file. If i am not mistaken, bas.h5 file only contains the quality information for the reads. So, for analysis we require .bax.h5 files. Also each .bax.h5 files correspond to different sample. After installation of SMRT, the first step is polishing using bax2bam that will remove all low quality and adapter sequences from the reads. After aligning the subreads.bam file to reference, why do we need to merge all bam files. And you you mean merge all bam files coming from each .bax.h5.

I need to perform quasispecie analysis for my data. once i have bam file using pbalign can i directly use the bam file to reconstruct quaasispecies or there are other necessary steps to take before i do that.

Much Thanks, zusman

ADD REPLY
1
Entering edit mode

Dear Zusman,

If i am not mistaken, bas.h5 file only contains the quality information for the reads. So, for analysis we require .bax.h5 files. Also each .bax.h5 files correspond to different sample.

Everything you said is correct so far.

Also each .bax.h5 files correspond to different sample. After installation of SMRT, the first step is polishing using bax2bam that will remove all low quality and adapter sequences from the reads.

But here, I'm not sure you clearly understood the process. bax2bam is not yet the tool that will perform polishing. bax2bam allow you to convert the "weird" raw PacBio format into a more convenient format : bam (note this is not yet an alignement). So bax2bam does not remove any read, it just convert raw reads into bam reads.

After aligning the subreads.bam file to reference, why do we need to merge all bam files. And you you mean merge all bam files coming from each .bax.h5.

Not really sure about your question here. If you have follow my logic above (but once again, this is a way of doing it, not necessary THE way), you will end with a alignement bam file for each SMRT cell (3 bax.h5 file). You are merging the bam file because arrow (this one is the polishing tool), takes only one alignement file as an input.

I need to perform quasispecie analysis for my data. once i have bam file using pbalign can i directly use the bam file to reconstruct quaasispecies or there are other necessary steps to take before i do that.

I have absolutely no idea (yet) of what quaasispecies is. I think you should explain this to me a little bit further.

I'm not sure I answered your question. Tell me if I missed something in your questions.

Cheers,

Roxane

ADD REPLY
0
Entering edit mode

Can you guide me how i can process my data. After installation, i need to perform polishing and as you mentioned you used Arrow. Can you mention the command you used. So Arrow removes the low quality reads and adapters?? Once i have performed polishing using Arrow, i can use pbalign to align my each of filtered .bax.h5 files. Or is it that we dont need to perform filtering and directly align against reference. I am confuse how pacbio works. Quasispecies is set of different variants within a viral population. My aim to to check for every patient at different timepoints how viral population behaved. For that i have to evaluate each bax.h5 file and extract information of possible quasispecies and compare across different timepoints.

ADD REPLY
0
Entering edit mode

arrow works on aligned data. Maybe you should read [1] to get an idea what arrow is doing.

Best, Sven

[1] = https://github.com/PacificBiosciences/GenomicConsensus/blob/master/README.md

ADD REPLY
0
Entering edit mode

This really help me a lot as I am new too. Do you have any idea to classify pacbio subreads into full length reads and non-full length?

ADD REPLY
0
Entering edit mode

Hello ! Well, I never did such a thing yet, I don't have any idea of how to do it, I'm sorry.

ADD REPLY
0
Entering edit mode

I'm trying to run PBalign as a step toward running Arrow. I have a large subreads.bam with ~1400x coverage. I have an alignment built with 200x coverage. I attempted to downselect my bam file with samtools to around the same coverage level but the new downselected bam file causes BLASR to freak out. I looked at the two files and they look different. any suggestions on how to make this work?

ADD REPLY
1
Entering edit mode

You should post this as a new stand-alone question.

ADD REPLY
0
Entering edit mode

These comment paths are hard to organiz, yikes.

ADD REPLY
0
Entering edit mode

On main Biostars page (or this page) find and click the New Post button at top right of the page. Paste the text of your question in there. Add relevant tags (no # sign needed before tag names).

ADD REPLY
0
Entering edit mode

I have the subreads.bam file and the assembly from canu. Can I just use the steps 1 (pbalign) and 2 (arrow) as mentioned above? Can you please point the arrow commands?

ADD REPLY
0
Entering edit mode

Hello Vijay ! As I wrote that post almost one year ago, I'm pretty sure that the PacBio command lines for arrow and quiver changed already... Also, sadly, I had to use quiver instead of arrow in the end because I had an error even using the safe install I described. That error was only when using arrow, so I never was able to try it in the end. I'm now working in a different workplace so I can't even retrieve what I tried at this time..

ADD REPLY
0
Entering edit mode

Hi Roxane

thanks for the response. I am happy that you looked at my comment even at this point since the original post is an year old. I understand that things can change pretty quickly specially for the evolving long read sequencing. Thanks again. I am still looking through other posts and struggling as you did before writing this post. If I find something, I will definitely post it here.

cheers

Vijay

ADD REPLY
0
Entering edit mode

Miss Boyer,

Thank you for your tutorial. I have a situation that I do not know how to solve. I am assembling a bacterial genome, but it seems to be the first one of its species that is sequenced. I do not have a reference genome for this species. What could I use as a reference genome? A bacteria from the same family? or E. coli? Regards Cecilio

ADD REPLY
1
Entering edit mode

Hi Cecilio !

You're welcome, glad this could help !

Concerning your issue, so you are doing do novo genome assembly. Do you have the estimated size of the bacteria's genome ? What sequencing technology did you used ? (pacbio ? illumina ?) In my opinion, you don't necessarily need a genome reference to perform de novo assembly. I didn't used one for my D. suzukii genome !

ADD REPLY
1
Entering edit mode

Miss Boyer,

Thank you for your answer. I am doing de novo assemblies of two bacteria. The sequences were obtained from pacbio. I am using Canu 1.7 & 1.8 in two different clusters. For one cluster (an Intel cluster) I am using Canu 1.8. For the other cluster (IBM) I am using Canu 1.7. The estimated size of the genomes are known.

I understand now that I do not need a reference sequence for assembling ( because I am ding a de novo assembly). However, for the next stage, "polishing", arrow needs a reference sequence (I have seen the notes on the usage of arrow). What reference sequence should I use for polishing?

Also, I have some issues with the assemblies. It seems that there may be some DNA from undesired bacteria in the pacbio output (perhaps contamination of the bacterial cultures).

I was advised to use BWA-mem (https://github.com/lh3/bwa) to map the reads to the known contaminant (I got the contaminant bacteria sequence from GenBank). After mapping, I should discard the reads that map to the contaminant and use the un-mapped reads for the assembly of my desired bacteria. This sounds good for one bacteria.

For the other one, I have got only unplaced-contings for the contaminant (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Moellerella_wisconsensis/latest_assembly_versions/GCF_001020485.1_ASM102048v1/GCF_001020485.1_ASM102048v1_assembly_report.txt) from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Moellerella_wisconsensis/latest_assembly_versions/GCF_001020485.1_ASM102048v1

How to map my reads to such a set of contigs (108 unplaced ones)?

Sorry to overwhelm you with so many questions.

Regards,

Cecilio1

ADD REPLY
1
Entering edit mode

Hello again Cecilio !

To make it clear here, the reference I"m talking about here ...

aligning your raw PacBio reads against the reference (your freshly baked PacBio assembly)

...is actually your assembly. So it means you have to align your raw reads against the assembly you just made.

This post I made may not be that "old", years wise, but I don't think it is updated anymore... PacBio were changing all their tools, their formats and their technologies at that moment... I think this whole tutorial really only works with "old" data produced with the "old" PacBio machines (RSII I think ? correct me if I'm wrong), not the Sequel.

Since that day, I've switched on the Oxford Nanopore sequencing technologies and I have to admit I'm not updated anymore on the tools and methods used for PacBio :/.

So I'll say this tutorial still helps to understand what the polishing step is, but you may want indeed to make a new post and get the properly updated tools. I'll try to have a look on the nowadays PacBio methods on my side, and I'll give you my discoveries here (or on your new post), or perhaps when I'll have even more time, give a fresh update to this tutorial :o)

Have a nice day !

Cheers,

Roxane

ADD REPLY
1
Entering edit mode

Miss Boyer,

Thank you for the information. And thank you for your commitment to help others to understand the process of genome polishing. After I solve the contaminant issues on my reads I will move on to polishing.

Best regards,

Cecilio

ADD REPLY
0
Entering edit mode

cecilio11 : You may want to post part about separating the contaminants as a new question to get maximum visibility.

You can use a set of contigs to align against by creating a reference index from them.

ADD REPLY
0
Entering edit mode

Sir genomax, Is samtools the right tool to create a reference index from the unplaced contigs? Regards, Cecilio1

ADD REPLY
1
Entering edit mode

If you are using bwa mem for alignments you can use bwa index for creating the index.

That said since you have PacBio reads then you may want to look at minimap2 (from author of bwa but for long reads) instead as aligner of choice (https://github.com/lh3/minimap2 ).

ADD REPLY
0
Entering edit mode

Sir genomax, I will give minimap2 a try. Thank you for your suggestion. Cecilio1

ADD REPLY
0
Entering edit mode

Sir/Madam genomax,

I did run minimap2 by using the fastq "contaminated" reads against the "contaminant" bacteria sequence. I used a simple command line: minimap2 -a Ref.mmi XXXXX.subreadsFQ.fastq > subreads_alignment.sam Today, Dec 1tth, the cluster I used to run this command is under maintenance and I cannot check the results.

I assume that the next step will be to extract a list of the names of the sequences that mapped to the reference (i.e., seqeunces in the "subreads_alignment.sam" file) and remove those read from the original file (i.e., "XXXXX.subreadsFQ.fastq").

To remove the sequences in a list I will generate from "subreads_alignment.sam", I plan to use the tools suggested in a previous biostars post: One of them is: qiime filter_fasta.py -f in.fastq -o out.fastq -s list_of_sequences.txt -n We have quiime in the cluster, so I can run the above command.

One thing that worries me is that the line-command I used for mnimap2 allows for 15% divergence. I wonder if this will be a problem with closely related species, because it could potentially remove sequences that are "valid" for the target genome.

Have you used the option "asm5" (~5% divergence) or "map-pb" for pacbio reads?

Regards,

Cecilio

ADD REPLY
0
Entering edit mode

Hi, my subread bam file is too big, around 300G, so when I am running pbmm2, the job always got killed for the limit of memory reason. Do you have any better idea to solve this kind of problem?

ADD REPLY
3
Entering edit mode
5.2 years ago
harish ▴ 470

PSA for any one browsing this thread, Arrow tools have been branded as gcpp if you are using their bioconda environment.

For faster polishing you can also use the following approach:

1/ Align and sort using pbmm2:

pbmm2 align genome.fasta movie.subreads.bam aligned.movie.bam --sort -j 8 -J 8 -m 32G --preset SUBREAD

(aligning the PacBio BAMs to the genome)

2/ Slice and dice bams If you have multiple such aligned.movie.bam, you can use bamtools to slice and merge the bams contig/scaffold wise.

ls *.aligned.movie.bam > input.fofn (collect all the aligned bams into a list of files)
samtools faidx genome.fasta (to easily retrieve the sequence headers)
cut -f1 genome.fasta.fai > sequence_heads.txt (retrieve the sequence headers)
for i in `cat sequence_heads.txt`; do echo "bamtools merge -list input.fofn -out "$i".bam -region "$i""; done | parallel -j20 (slicing and merging multiple bams sequence wise)
for i in `cat sequence_heads.txt`; do echo "pbindex "$i""; done | parallel -j20 (generating Pacbio specific indices for the BAMs)

(Yes, I tend to loop and pipe into parallel for easier understanding :( )

3/ Run Arrow/gcpp.

Note: Split your genome fasta, so that each contig/scaffold has its own file, i.e contig1.fasta, contig2.fasta and faidx index them. This helps in parallelizing the polishing step. Also, by splitting the sequence into files, we are polishing that particular sequence. You can simply use the script I've put in here: https://raw.githubusercontent.com/harish0201/General_Scripts/master/Fasta_splitter.py

python Fasta_splitter.py genome.fasta 2> /dev/null
mkdir polished_seqs
for i in `cat sequence_heads.txt`; do echo "gcpp -r "$i".fasta -o polished_seqs/"$i".polished.fastq -o polished_seqs/"$i".polished.vcf -o polished_seqs/"$i".polished.gff -n 5 --annotateGFF --reportEffectiveCoverage "$i".bam"; done | parallel -j20

(at this step I'm generating the polished fastqs, annotated GFFs which will spew out coverages etc and a vcf, so that in case I lose the polished fastq I can still use bcftools consensus to create the modified sequence)

4/ Concatenate the genome:

cd polished_seqs
cat *.fastq > ../../polished.fastq

At this juncture you can repeat steps from 1-4 on the polished.fastq in order to polish iteratively.

Note: A few places where you will have to tune above are: 1. The data type used by you : --preset

  1. The number of threads controlled by -j or -J in pbmm2

  2. Number of jobs running in parallel: parallel -j. Since the contig/scaffold length determines the RAM usage, you will have to play around a bit. From my observation, if your N50 > 5 Mb prior to polishing, and your RAM is roughly 512Gb use 10 jobs in parallel. Natually, if the N50 > 1 Mb, but N50 < 5Mb, you can use 20 jobs in parallel.

ADD COMMENT
0
Entering edit mode

how many rounds of Arrow would you recommend ?

ADD REPLY
0
Entering edit mode

gcpp step crushes with the following error message

terminate called after throwing an instance of '(anonymous namespace)::CommandLineParserException' what(): [pbcopper] command line parser ERROR: unknown option 'n'

ADD REPLY
0
Entering edit mode

ERROR: unknown option 'n'

There is no option like that in the gcpp command line. You should check yours.

ADD REPLY
0
Entering edit mode

Dumb question: Does GCpp support MPI, or parallelisation on a cluster natively? In other words, if I can spread the polishing step across an HPC environment, I hoping to reduce my wall-time dramatically.

ADD REPLY
0
Entering edit mode

Nope - just heard back from PacBio on GitHub. Has to either done on a single-node or chunked and re-stitched together.

ADD REPLY

Login before adding your answer.

Traffic: 2630 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6