Hi Picasa !
Indeed, you will need to perform a polishing step after assembly, this will lower your error rate from ~1% (after assembly with canu) to ~0.001-0.0000001% (depending of the coverage used for polishing, you can find information in this link : https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/FAQ.rst section "What is the expected quiver accuracy? ")
So, you indeed need a alignement file, but with PacBio, everything is a bit more complicated. Considering their bas/bax.h5 format, you will need a particular aligner (blasr you said it right) that will produce an alignement. However, this alignment will be not in bam format, but in a specific format called .cmp.h5 (cmp strand for comparison I guess).
Blasr align file by file. In order to align all your .bas/bax.h5, you are going to use an other tool called pbalign. This tool can take as an input a .fofn file (file of file ... forget about the n) which is a list of all the path that lead to your raw data, and will literally call blasr for all your raw data file and then compile their result.
Here a small example of how a .fofn file should look (sorry for horrible paths) :
/media/DATAPART1/pacbio_reads/run6/H01_1/Analysis_Results/m151220_023946_42237_c100942502550000001823214006101687_s1_p0.1.bax.h5
/media/DATAPART1/pacbio_reads/run6/H01_1/Analysis_Results/m151220_023946_42237_c100942502550000001823214006101687_s1_p0.2.bax.h5
As it was really hard for me to gather this informations, here how I used pbalign :
pbalign --forQuiver raw_reads.fofn assembly_to_polish.fasta your_output.cmp.h5
Then, you can use quiver using something like this :
quiver your_output.cmp.h5 -r assembly_to_polish.fasta -o polished_assembly.fasta
For installing pacbio tools, I redirect you to this github thread : https://github.com/PacificBiosciences/pbalign/issues/67
Don't try to install all PacBio tools by yourself, don't even try pitchfork, you will lose a lot of time just as I do.
Also, a related Biostar post here : A: Choosing de novo genome assembly
Good luck with your polishing ! This way worked fine for me after weeks or research on this subject.
If he needs bam file I think he can use
bwa mem -x pacbio ref.fa reads.fq | samtools view -F 4 -Sbh - | samtools sort - my_result_sorted.bam
instead of blasr
Yes but the problem after that is that quiver will not necesserly work well with bam files. Indeed, the particular format for PacBio files take account of various chemistery parameters and quality values. As I said in my previous post, I think that soon or later, PacBio is going to fully abandon theses format and fully use bam/sam. For now, I will not recommand to use anything else but PacBio format because I had a lot of struggle trying to use classic bam/sam format. You can see my struggle on github here : https://github.com/PacificBiosciences/pbalign/issues/48 . I also in the past tried to use quiver using a bam/sam file. It gaves me errors (I don't have wrote them to transcribe it there) because quiver wasn't taking care of sam/bam file (even if it is written on their wikipage...)
Hi Roxane and many thanks for your help !
However I don't have .bas/bax.h5 files but only fastq (my raw reads are fastq files) so can I still use pbalign with your command ? I mean, does it recognize fastq ?
It may be better overall if you can get the *.h5 files from your source.
Hi Picasa,
I agree with Genomax, I had a lot of trouble and could never achieve the final polishing step by using fastq files and pâcbio tools. You have to find your raw reads, they are essentials. Also, try to acknowledge what your sequencing coverage was, it is a very essential information !
I have the sequencing coverage information; but why is it important ?
Coverage is a major parameter with PacBio long read sequencing, because as their raw reads have bigger error rate (~15%), your strategy to lower down this error rate will be different depending on the number of PacBio reads you have.
If you are new to PacBio sequencing, I advise you to read this manual wrote by Mr Chakraborty : https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkw654
It show the importance of sequencing coverage and how to obtain good assembly metrics depending on your coverage.
You have different strategy to polish your genome, one is called Self corrected, because you use raw PacBio reads to correct your assembly, and other method are called hybrid, because, in certain cases, you can use shorter Illumina reads to correct your assembly.
Self correction is used when you have enough coverage, Hybrid correction is used when you have low coverage.
I hope I was clear with that, if you have more question about that, don't hesitate to ask me ! But I'm sure you will find all you need in this really great manual !
Hi Roxane,
I was reading this for quiver
https://github.com/PacificBiosciences/GenomicConsensus
and I would like to know if you have some benchmark using the different algorithm proposed by variantCaller (quiver,arrow,plurality,poa,best).
I mean, does it change a lot ? The link above said that apparently arrow is better than quiver
Hi Picasa !
Sadly, I never managed to get arrow to work on my computer (installation issue), same thing with the others algorithm... I will try to test it again, I also really want to know how different the two algorithm work, and especially want to know if one of them is parallelizable (as quiver is not)