Tool:BBSketch - A Tool for Rapid Sequence Comparison
0
23
Entering edit mode
7.9 years ago

MinHash Sketch is a method of rapidly comparing large strings or sets. In genomics, you can use it like this:

1) Gather all the kmers in a genome.

2) Apply a hash function to them.

3) Keep the 10000 smallest hashcodes and call this set a "sketch".

If you do this for multiple genomes, you can calculate how similar two genomes are much faster than via alignment; and specifically, the greatest advantage is that the speed of sketch comparison is unrelated to genome size. For example, if you assemble something, you can sketch it and compare it to sketches of everything in nt or RefSeq in a couple seconds. If the top hit shows that your new sketch shared 90% of its hash codes with E.coli, that means it has roughly 90% kmer identity to E.coli, and therefore, it is E.coli.

The BBMap package has an extremely efficient MinHash Sketch implementation, now accessible through 4 programs. The latest, just released in BBMap 36.92, is SendSketch:

sendsketch.sh in=contigs.fa

This will make a sketch of your assembly, open an HTTP connection to JGI's taxonomy server, and send the sketch. The server will compare it to sketches of all of nt, and return the top hits. This is kind of convenient because the sketches sit around in memory all the time rather than needing to be loaded, so you get the answer in a couple seconds. You can also add the flag "refseq" to compare to RefSeq instead of nt. Additionally, SendSketch works on raw reads (including Illumina, PacBio, and Nanopore). With Illumina the files can be huge and you don't need all of the reads so I normally run it like this:

sendsketch.sh in=reads.fq reads=400k

...to limit it to the first 400,000 reads. Raw PacBio has a high error rate so I typically increase the default sketch size to increase sensitivity:

sendsketch.sh in=PacBio.fq size=100000

You can also run a local server with "taxserver.sh" if you want to use a different database for SendSketch.

The other tools are for processing sketches locally. To sketch and query refseq, for example:

sketch.sh in=refseq.fa.gz out=refseq#.sketch files=31 mode=sequence

That will produce one sketch per sequence, and split them among 31 output files (for rapid loading). Alternatively, and preferably, you can produce one sketch per taxID. This process is a bit more involved, but the full details are given in the BBMap package in the /bbmap/pipelines/ directory, in these example shellscripts:

fetchtaxonomy.sh
fetchnt.sh
fetchRefSeq.sh

Now that you have created some sketches, you can query them like this:

comparesketch.sh in=contigs.fa refseq*.sketch

"in=" can be a comma-delimited list of either fasta or sketch files. Also, you can turn an input fasta into one sketch per sequence rather than just a single sketch with the "mode=sequence" flag. If you want to process with CompareSketch using taxonomic information, add "tree=taxtree.gz" as well (which was created by the "fetchtaxonomy.sh" script). SendSketch automatically uses taxonomy on the server-side.

So, that's how you run Sketch. Now, what does the output mean? Here's an example, using E.coli:

http://ibb.co/jLk0Ua

The complete description of columns is in /bbmap/docs/guides/BBSketchGuide.txt, but briefly, "matches" is the number of matching kmers between the query and reference sketch. KID (or kmer identity) is simply the percent of the kmers that matched, while WKID is the weighted kmer identity, which has been adjusted to compensate for genome size. The ANI, completeness, and contamination are estimates, comparing the query to that reference.

Titus Brown suggested that I cite prior art, since this is not a novel idea, so:

The MinHash concept, which is about 20 years old: https://en.wikipedia.org/wiki/MinHash

Mash: https://github.com/marbl/Mash

SourMash: https://github.com/dib-lab/sourmash

minhash sketch bbmap blast • 15k views
ADD COMMENT
0
Entering edit mode

Nice. I think this is how miniasm does it's alignments also?

ADD REPLY
0
Entering edit mode

Oh, I'm not sure; I'll have to examine miniasm in more detail. Just skimming, it looks like, "maybe" or "something similar". But Mash and SourMash certainly do use this principle.

ADD REPLY
1
Entering edit mode

Canu runs MinHash too for the first alignment of noisy reads, see Fig. 1 in 'Assembling large genomes with single-molecule sequencing and locality-sensitive hashing'

There's also kWIP, which works differently, but arrives at similar results and can be used for the same things: http://biorxiv.org/content/early/2016/10/04/075481

ADD REPLY
0
Entering edit mode

This is great. Thanks! I see documentation for building sketches by sequence or taxon -- do you have a recipe for building a reference database where the sketches each represent an assembly/genome?

ADD REPLY
1
Entering edit mode

Currently I use taxa as a proxy for assembly or genome. There is no "make one sketch per file" mode, which would be quite useful and I keep meaning to add it. But there is "one sketch" mode, so if each assembly is in a single file, you can loop over them and run one BBSketch process per file to generate one sketch per file. Then concatenate the sketches. Sketches are just text files and a .sketch file can contain one or more sketches, so concatenating them is fine.

ADD REPLY
0
Entering edit mode

What is the default kmer size for BBSketch?

ADD REPLY
0
Entering edit mode

could you please give us the link to the BBsketchGuide.txt

ADD REPLY
0
Entering edit mode

It should be in your BBMap source bbmap/38.46/bbmap/docs/guides/BBSketchGuide.txt. Version number of software may be different in your case.

ADD REPLY
0
Entering edit mode

Thank you for this new interesting tool.

I used sendsketch to annotate metagenome assembled genomes and found huge discrepancies compared to other annotations.

E.g. here I have a MAG of akkermansia (97% complete, no contamination inferred with checkm).

May this be a problem to the phylum?

The outputs of sendsketch in nucleotide and protein mode looks as follows:

Query: MAG031.fasta DB: RefSeq  SketchLen: 13872    Seqs: 23    Bases: 1135113  gSize: 1118687  File: MAG031.fasta
WKID    KID ANI Complt  Contam  Matches Unique  TaxID   gSize   gSeqs   taxName
0.05%   0.01%   75.93%  26.21%  1.33%   3   0   1262449 4208011 42  Clostridium pasteurianum DSM 525 = ATCC 6013
0.05%   0.01%   71.14%  29.12%  1.37%   3   0   1812858 3780227 51  Robinsoniella sp. MCWD5

Total Time:     3.755 seconds.
(base) [kiesers@login1 genomes]$ sendsketch.sh in=genomes/$g.fasta protein
Adding /home/kiesers/scratch/miniconda3/opt/bbmap-38.57-0/resources/blacklist_prokprot_merged.sketch to blacklist.
0.044 seconds.
Loaded 1 sketch in 0.442 seconds.

Query: MAG031.fasta DB: ProkProt    SketchLen: 6222 Seqs: 1002  SeqLen: 330118  gSize: 310565   File: MAG031.fasta
WKID    KID AAI Complt  Contam  Matches Unique  TaxID   gSize   gSeqs   taxName
0.51%   0.19%   59.71%  32.75%  13.86%  24  0   2293161 800041  2633    Ruminococcus sp. AF21-11
0.51%   0.18%   59.79%  30.47%  13.93%  23  0   2293216 857776  2890    Ruminococcus sp. AM43-6
0.49%   0.19%   59.51%  32.87%  13.85%  23  0   1160721 797656  2762    Ruminococcus bicirculans
0.48%   0.19%   59.61%  33.09%  13.84%  23  0   2293154 792481  2624    Ruminococcus sp. AF17-6
0.48%   0.19%   59.55%  33.69%  13.74%  23  0   2293155 779611  2581    Ruminococcus sp. AF17-6LB
0.48%   0.19%   59.55%  33.67%  13.74%  23  0   2293152 779956  2574    Ruminococcus sp. AF17-1AC
0.48%   0.19%   59.51%  34.13%  13.67%  23  0   2293153 770401  2530    Ruminococcus sp. AF17-24
0.48%   0.17%   59.34%  30.87%  13.92%  22  0   2293226 848675  3153    Ruminococcus sp. AM54-1NS
0.47%   0.18%   59.13%  32.70%  13.87%  22  0   2293169 801393  2676    Ruminococcus sp. AF26-25AA
0.46%   0.18%   59.10%  32.96%  13.86%  22  0   2293233 795704  2663    Ruminococcus sp. OM07-17
0.44%   0.18%   59.00%  35.62%  13.47%  22  0   2293221 741446  2633    Ruminococcus sp. AM47-2BH
0.46%   0.17%   59.14%  30.88%  13.94%  21  0   2293243 848340  3022    Ruminococcus sp. TF12-2
0.46%   0.17%   59.12%  30.97%  13.95%  21  0   2293156 846300  3010    Ruminococcus sp. AF18-29
0.46%   0.17%   59.07%  31.41%  13.89%  21  0   2293165 835853  2770    Ruminococcus sp. AF25-19
0.45%   0.17%   59.02%  31.88%  13.81%  21  0   2293149 823414  2805    Ruminococcus sp. AF16-50
0.45%   0.17%   58.99%  32.15%  13.83%  21  0   2293177 814987  2691    Ruminococcus sp. AF34-12
0.45%   0.17%   59.00%  32.11%  13.83%  21  0   2293194 816556  2722    Ruminococcus sp. AM28-13
0.45%   0.17%   58.99%  32.17%  13.84%  21  0   2293157 813803  2719    Ruminococcus sp. AF19-15
0.45%   0.17%   58.98%  32.25%  13.86%  21  0   2293202 811644  2995    Ruminococcus sp. AM31-15AC
0.44%   0.17%   58.91%  32.75%  13.91%  21  0   2293178 799840  2619    Ruminococcus sp. AF37-20
ADD REPLY
0
Entering edit mode

If it has super-low identity to anything in RefSeq, Sketch will not be able to identify it correctly. In this case the nucleotide hits are totally random; they only shared 3 sketch kmers, with estimated ANI under 76%, and Sketch is not accurate at under 76% ANI.

The protein results are much more interesting. Whatever your sample is, it definitely shows greater similarity to Ruminococcus than anything else in RefSeq, at the protein level. It's still really low, though, so it could be a result of lateral gene transfer from a totally different organism that is present in RefSeq.

The "contam" field does not actually mean your assembly is contaminated. Rather, it means that "if this is your target organism, then your assembly is contaminated".

ADD REPLY
0
Entering edit mode

I recently got a SSL error with a command that was working perfectly yesterday.

javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

Here is the full error: https://pastebin.com/raw/1RFNZhTW

ADD REPLY
1
Entering edit mode

BBSketch servers are hosted on NERSC, which is down for ~5 days starting at 8am PST this morning for electrical work. Unfortunately, that means that they are also down. In fact, everything at JGI is down until Monday or Tuesday night.

ADD REPLY
0
Entering edit mode

As bbsketch builds the sketch of 1 fasta file. What would be the best way to generate one sketch from thousands of fasta files. Similar to the behavior of mash.

ADD REPLY
0
Entering edit mode

As bbsketch builds the sketch of 1 fasta file. What would be the best way to generate one sketch from thousands of fasta files. Similar to the behavior of mash.

ADD REPLY
0
Entering edit mode

Does cating the files together work?

ADD REPLY
0
Entering edit mode

Yes, catting would work; in addition to catting the files, you can also stream them to stdin:

cat *.fasta | sketch.sh in=stdin.fa onesketch out=xyz.sketch

That said, I'm not sure why you would want to make one sketch for thousands of fasta files. If you want one sketch per input file, with all of the sketches combined in a single output file, the command would be:

sketch.sh *.fasta out=lotsOfSketches.sketch perfile
ADD REPLY
0
Entering edit mode

Perfect

sketch.sh *.fasta out=lotsOfSketches.sketch perfile

that was what I was looking for!

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

I've been having trouble with the fetch scripts:

Version: BBMap version 38.93

RefSeq

(bbmap_env) -bash-4.2$ /usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/bbmap_env/opt/bbmap-38.93-0/pipelines/fetch/fetchRefSeq.sh
java -ea -Xmx1g -Xms1g -cp /usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/bbmap_env/opt/bbmap-38.93-0/current/ tax.RenameGiToTaxid -Xmx1g in=stdin.fa.gz out=renamed.fa.gz pigz=16 unpigz zl=9 server ow maxbadheaders=5000 badheaders=badHeaders.txt bgzip
Executing tax.RenameGiToTaxid [-Xmx1g, in=stdin.fa.gz, out=renamed.fa.gz, pigz=16, unpigz, zl=9, server, ow, maxbadheaders=5000, badheaders=badHeaders.txt, bgzip]

Time:                           488.852 seconds.
Reads Processed:       39175    0.08k reads/sec
Bases Processed:       1273m    2.61m bases/sec

Valid Sequences:    39175
Valid Bases:        1273716988
Invalid Sequences:  0
Invalid Bases:      0
Exception in thread "main" java.lang.RuntimeException: tax.RenameGiToTaxid terminated in an error state; the output may be corrupt.
    at tax.RenameGiToTaxid.process(RenameGiToTaxid.java:307)
    at tax.RenameGiToTaxid.main(RenameGiToTaxid.java:39)
ADD REPLY
0
Entering edit mode

Probably best to post this as a separate question

ADD REPLY
0
Entering edit mode

I've seen that sensketch can generate a D3 json output. Can you explain how to use the D3.json output?

ADD REPLY
0
Entering edit mode

If you add the "d3" flag, an extra JSON section named "D3" will be created in the results for a query. Adding a flag like "d3anisize" will specify what value is used for the "size" key of the node (default is number of matches). You can add the flag "d3" to see what the output looks like, and whether it will be useful to you; I added it specifically for a visualizer one of my co-workers wrote. The goal is basically to show a tree view of the hits so you can see different organisms with different size circles indicating how close a match they are, and then you can follow it up the taxonomic tree with higher level nodes being the sum or max of their children.

In other words, it's not like it generically supports D3 applications; this was just a one-off where I added a custom flag for a specific javascript tree visualization and I wasn't sure what to call the flag. As for using the output... I will get back to you. I think the visualizer is public.

Edit: And... no. The computers hosting the visualizer have been retired.

ADD REPLY

Login before adding your answer.

Traffic: 3497 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6