In order to polish my Falcon assembly of my D. suzukii genome, from PacBio long reads, I need to align the raw reads to my genome and create a .bam or a .sam file.
Such a tool already exist at PacBio, called blasr, but is currently not working (the 5.3 version I've installed do not provide a .bam or a .sam output... Which is VERY annoying).
Does an alternative of such aligner exist ? It has to be a aligner aware of the higher error rate of PacBio reads (that's why I can't use a classic aligner).
Any ideas ? I'm truely desesperate, I absolutely need to polish my assembly before going further in my analysis.
is a versatile sequence alignment program that aligns DNA or mRNA
sequences against a large reference database. minimap2 is tens of times faster than mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP.
Roxane Boyer : With BBMap PacBio reads needed to be in fasta format and those above 6kb were broken into 6kb pieces during search automagically (with an _1, _2 etc). Since Brian Bushnell does not remove functionality from BBMap suite it should still work. Also see this original post. Try and let us know.
I'm coming back to you again about BBmap.
After having some issues with bwa-mem (depending on the number of threads used, I didn't get the same results), I want to try bbmap. I understood that the large reads from my pacbio sequencing will have to be sliced in smaller pieces. However, as it's my first time using this tool, I was looking for some advanced parameters for PacBio reads, here what I found ( http://seqanswers.com/forums/showthread.php?t=58221 ) :
The "maxlen" flag shreds them to a max length of 1000; you can set
that up to 6000. But I found 1000 gave a higher mapping rate.
But this section was "Oxford Nanopore" section, I assume that the author meant long reads generally, but as I'm not really sure, I wanted to have your opinion.
Should I use 6kb or 1kb for the reads slicing for PacBio data ?
What do you mean limited to 6kb ? The read length ? Because some of my reads are much longer than 6kb.
Anyway, thanks for the tools I'm going to have a look on it
./bwa index /media/loutre/SUZUKII/assembly_tries/pb_268_p_ctg.fasta
./bwa mem -x pacbio /media/loutre/SUZUKII/assembly_tries/pb_268_p_ctg.fasta /media/DATAPART1/Documents/suzukii_assembly/filtered_subreads.fastq | gzip -3 > /media/loutre/SUZUKII/assembly_tries/falcon_aligned_reads.sam.gz
Then I wanted to visualize my alignements on IGV with the decompressed sam file, but no reads where shown, IGV said that no reads was corresponding to my genome (but it was actually the right file).
I'm not very familiar with the SAM format, but it looks like something is wrong.
Did I missed something ? Do I have to reformat the output so it look like more at something like this :
Well, it says that it failed to load the index, I guess it means it's not a real sam file ?
The construction of my file look weird for me. I'm not sure if this is because I didn't saw a lot of Samf ile, or if it's because something really went wrong.
I have had much better luck for PACBIO alignment with BLAT, followed by some filtering for "true" alignments (alignment length, number of matches, etc.). It took a bit of work to set the parameters, and I haven't dealt with it in awhile, but it worked well for finding alignments (even for partial hits when I was working with draft assemblies).
It would be useful to list those parameters, if you can dig them up. Will save people having to go through the same effort again (or at least serve as a starting point).
I would have, but everything is saved over at my previous position where I can't access it anymore. I do remember looking at the identity and coverage of each hit and using dot plots to train the parameters.
Hi Brent, thanks for your indication ! So you don't have any memory of how we can parameter BLAT for such an use ? it will be really useful for me to have such an information ! If you remember of something, to hesitate to write it down here !
Sorry to bring up the thread, but I wanted to put it down in case someone else faced the issue. I have recently used minimap2 to map the PacBio reads and then use pbbamify to put the alignment in the requisite format for Arrow.
I had about 100x data for a genome I am assembling and blasr was taking ages.
Please do give this a try!
Update: You can now simply use PBMM2 from pb-bioconda to create the aligned bams and later user pbindex to generate the BAM index if needed.
Hello and thanks for all these suggestions. These further comments might help someone else.
I have used minimap2 to map PacBio raw reads (converted to fastq as required for minimap2) against my de novo assembly. I recommend this approach over blasr. Use samtools to get the aligned.bam. Minimap2 does not output bam files.
I then used smrt tools to create a mydata.xml file using the following:
create mydata.xml /bam/*subreads.bam
Then use pbbamify to convert the aligned.bam to correct pb format for submission to genomicconsensus arrow. Took a while to work through this process - thought it might help someone else.
Minimap doesn't really care for fastq files tbh. As such the Sequel runs have series of '!!!!!!' representing the qualities anyways.
You can definitely use Minimap's "-a" parameter to get a sam file and then convert it into Arrow specific format using pbbamify.
Also take a look at pbbioconda repos. If you have the Bams from Sequel, you can directly use Pbmm2, which uses minimap api to align the reads to the genome.
Plus these days you don't really need SMRTlink tools unless you are really looking at specific analysis. Most genome analysis these days are covered in pbbioconda!
Thanks for these tips @harish. I did not know that I could directly use subreads with Pbmm2. Actually this all looks more promising as a way forward.
Pbbamify worked after 130 hours of walltime. It output a very large pb.bam (351G) from aligned.bam (36G). Arrow still needs a pbi indexed file according to error report.
Roxane Boyer : With BBMap PacBio reads needed to be in
fasta format
and those above 6kb were broken into 6kb pieces during search automagically (with an _1, _2 etc). Since Brian Bushnell does not remove functionality from BBMap suite it should still work. Also see this original post. Try and let us know.Okay thanks you a lot for the advices ! I'm going to try this !
Hi Genomax,
I'm coming back to you again about BBmap. After having some issues with bwa-mem (depending on the number of threads used, I didn't get the same results), I want to try bbmap. I understood that the large reads from my pacbio sequencing will have to be sliced in smaller pieces. However, as it's my first time using this tool, I was looking for some advanced parameters for PacBio reads, here what I found ( http://seqanswers.com/forums/showthread.php?t=58221 ) :
In this post, it is written that :
But this section was "Oxford Nanopore" section, I assume that the author meant long reads generally, but as I'm not really sure, I wanted to have your opinion.
Should I use 6kb or 1kb for the reads slicing for PacBio data ?
What do you mean limited to 6kb ? The read length ? Because some of my reads are much longer than 6kb. Anyway, thanks for the tools I'm going to have a look on it
Okay, I've tried the following lines :
Then I wanted to visualize my alignements on IGV with the decompressed sam file, but no reads where shown, IGV said that no reads was corresponding to my genome (but it was actually the right file).
Here the head of my sam file :
I'm not very familiar with the SAM format, but it looks like something is wrong. Did I missed something ? Do I have to reformat the output so it look like more at something like this :
Why do you pipe bwa mem to gzip? I haven't seen something like that before and would suggest this:
After alignment the sam file is converted to bam and sorted without intermediate files.
Well, I've just copied what is written on the README file of bwa mem ! I'm not a original person, I've just followed the example they gived :
But well, I'm going to try without compressing it.
To go back on my first file output, we agree that this sam file is empty right ? There is no informations inside it ?
To get information about the reads in a sam/bam file use:
Well, it says that it failed to load the index, I guess it means it's not a real sam file ? The construction of my file look weird for me. I'm not sure if this is because I didn't saw a lot of Samf ile, or if it's because something really went wrong.
Did you create an index of the file using the following command?
I've no experience with creating sam.gz files, so it's possible that this is the source of the problem