via @BioMickWatson
Looks pretty cool. Perhaps once again the answer to all bfx questions will be BLAST
— Mick W@tson 🐄 (@BioMickWatson) September 9, 2016
RE https://t.co/4D5e9QQnrb pic.twitter.com/bwW3y0yl2n
Magic-BLAST is a tool for mapping large next-generation RNA or DNA sequencing runs against a whole genome or transcriptome. Each alignment optimizes a composite score, taking into account simultaneously the two reads of a pair, and in case of RNA-seq, locating the candidate introns and adding up the score of all exons. This is very different from other versions of BLAST, where each exon is scored as a separate hit and read-pairing is ignored.
Magic-Blast incorporates within the NCBI BLAST code framework ideas developed in the NCBI Magic pipeline, in particular hit extensions by local walk and jump, which is faster than Smith-Waterman extension (http://www.ncbi.nlm.nih.gov/pubmed/26109056), and recursive clipping of mismatches near the edges of the reads, which avoids accumulating artefactual mismatches near splice sites and is needed to distinguish short indels from substitutions near the edges.
We call the whole next generation run (from Illumina, Roche-454, ABI, or another sequencing platform excluding SOLiD), a query. The input reads may be provided as SRA accession or file in a SRA, FASTA, FASTQ, or FASTC format. Read pairs can be presented as parallel files, or as successive reads in a single file.
The reference genome or transcriptome can be given as a BLAST database or a FASTA file. It is preferable to use BLAST database for large genomes, such as human, or transcript collections, such as all of RefSeq, Ensembl, or AceView. The procedure for creating a BLAST database is described below.
And how does it perform?
I tested Magicblast with a list of unique kmer (23 nt long) and mapped them using Magicblast. It finds most back at the right position. It could even find split mapping of single 23 nt kmers with insertions of thousands of nucleotides in between. However, it also filters out some of the kmer sequences on unclear criteria. The filtered out unique kmer was found by jellyfish and could be mapped back by bwa and bowtie as unique kmers. Example of a filtered out kmer (mapped against the quinoa genome sequence in https://www.nature.com/articles/nature21370 (also searchable on NCBI):
result with bowtie:
233 0 C_Quinoa_Scaffold_2862 818606 255 23M * 0 0 CATCTGTCTTCTCTTCGCGTCGA IIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:23 NM:i:0 XM:i:2
result with magicblast:
233 4 * 0 0 * * 0 0 CATCTGTCTTCTCTTCGCGTCGA * YF:Z:F
That is a pity because magicblast gives the total number of hits (with up to a specified number of edits) and that is a handy function. It should of course not miss any hits that are totally obviously correct.
Adding one extra nucleotide (also a non-matching) gives the right mapping position in Magicblast. All other sequences in a set of about 80 with length 23 nt did map correctly.
Why did Magicblast filter out this sequence?
Just found out the solution: add -validate_seqs false to the command line and the problem is gone. Apparently, magicblast see a problem in the above read and finds it might not be a high quality read. Maybe because of the repeat of TCTTC in it? Criteria for validate sequences not described in the manual.
Source and binaries for those that are interested: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/magicblast/1.0.0/
That will save SO much time to take SRA files or accessions directly. Also, what is FASTC? All I can find is this project that generates a compression format of the same name. It seems unlikely that they'd support an arbitrary format like this so I'm curious what this means.