Question

K-mer sequencing vs sequence-alignment

1

Entering edit mode

3.9 years ago

GoldenRetriever ▴ 40

I am a completely rookie on Bioinformatics, so please bear with me and use simple language (I am a computer scientist) :-)

How can we use k-mers to find out if a gene is similar to our query string?

For example: We have a reference gene r = ACAAGTC, and a query string q = CATGT. For sequence alignment we could get the two possible solutions: (see photo on link)

enter image description here

But if we use k-mer with k=2 (reason for 2 is that I tried using k=3 and got 1 match) we could split q into the set = {CA, AT, TG, GT} we can get 2 matches.

I am confused on how to use k-mer for queries. I am guessing that it would also need to know where in the reference string it is found, since order matter. But most importantly, Why can we use K-mers? That's maybe my biggest question mark.

short-reads next-gen alignment k-mer • 5.5k views

ADD COMMENT • link updated 21 months ago by Ram 44k • written 3.9 years ago by GoldenRetriever ▴ 40

0

Entering edit mode

If you intend to identify genes then using this toy example is inappropriate. One would need to use much longer k-mers. This paper may help provide some background reference materials. One more tool.

ADD REPLY • link 3.9 years ago by GenoMax 147k

0

Entering edit mode

Separate to my answer below, I would contend that the right hand alignment is all that realistic (no matter the method). Unless you were artificially penalising gaps very low (which is the opposite of the biological reality) I'd be very suprised if you would come across 2 offset gaps like that.

ADD REPLY • link 3.9 years ago by Joe 21k

0

Entering edit mode

21 months ago

mattmoore_91 • 0

Check out KmerAperture which deals with this question directly

https://github.com/moorembioinfo/KmerAperture

ADD COMMENT • link 21 months ago by mattmoore_91 • 0

0

Entering edit mode

Please create a single new tool tagged post for your software instead of posting it in multiple old threads. Check the tool tag for post examples.

ADD REPLY • link 21 months ago by GenoMax 147k

score 6 · Accepted Answer · 2021-01-18

6

Entering edit mode

3.9 years ago

Joe 21k

You wouldn't generally use kmers this short as the odds of finding spurious matches becomes massive, but I accept that this is a toy example.

To my knowledge, we don't really approach a sequence alignment with kmers, though you might use them to seed the alignment (in a local alignment, small subregions of alignment are found and then slowly expanded).

You can however compute useful statistics using kmers about how similar 2 sequences are when they are inefficient to align (e.g. a whole genome). It's not uncommon to look at sequence 'distances' in this way. You might find this code interesting: https://github.com/jrjhealey/bioinfo-tools/blob/master/StringComparisons.py#L158-L181

Those are a few of the most common kmer string/sequence comparison methods implemented in python.

You can use kmers to get a rough approximation of 2 large sequences in a very computationally efficient manner, where size or number of sequences becomes prohibitive, with the idea being that the more similar 2 sequences are, the more likely you are to randomly select the same kmer from each of them and the more frequently they will probably occur.

Not sure if that answers your question or not?

ADD COMMENT • link 2.8 years ago by Joe 21k

1

Entering edit mode

Sorry for the toy example, but that was the best I could come up with due to my limited knowledge on the subject. I am interested in knowing how Kingsford in his paper "Fast search of thousands of short-read sequencing experiments" uses k-mers as a substitution for fx. BLAST. He makes use of Bloom Filters to query a sequence that is divided into k-mers. If the match of k-mers is above a threshold, then we would say that our query has a similarity.

I don't understand how it's possible to use k-mers instead of alignment tools?

ADD REPLY • link 3.9 years ago by GoldenRetriever ▴ 40

0

Entering edit mode

I don't know/haven't read the paper, but my immediate guess would be that they are doing something as I described above. That is, they figure out the 'kmer library' from a sequence (or a subset thereof) and then look for these kmers within the query sequence. Because the k is small, you will expect at least some exact matches, and the more matches, and the longer the k, the more justified in saying "these 2 sequences are similar" you are.

Searching for exact matches can be done with simple string manipulations and is pretty efficient compared to alignment, so this means you then don't have to do any aligning since you're only looking direct substring matches.

ADD REPLY • link 3.9 years ago by Joe 21k

0

Entering edit mode

Alright, thanks.

One last thing: Do you where I (for the layman) can read about k-mer sequencing and its advantages over sequence alignments?

ADD REPLY • link 3.9 years ago by GoldenRetriever ▴ 40

0

Entering edit mode

Guess you have not looked at links for the two papers I posted in my comment above.

ADD REPLY • link 3.9 years ago by GenoMax 147k

0

Entering edit mode

I have skimmed it. Thanks. But they are trying to improve technologies I have no clue about. I am looking for more basic stuff that can answer me why we even can use k-mer sequencing for determining whether two strings are similar or not. Fx. What would happen if we did match many k-mers but they were all scattered, i.e. one k-mer sequence here, and one k-mer sequence there, you know the song..

ADD REPLY • link 3.9 years ago by GoldenRetriever ▴ 40

2

Entering edit mode

What would happen if we did match many k-mers but they were all scattered, i.e. one k-mer sequence here, and one k-mer sequence there

In case it wasn't obvious in my answer, this is exactly what we do do already. Don't fall in to the trap of thinking kmer methods imply any kind of 'contiguity' of sequence. This is why they are good for really cumbersome sequences.

Imagine it like this:

I have a book that you need to identify, but you can only identify it by pulling random clippings out of a hat.

You'll probably pull out a lot of "the"s and "and"s, but if my mystery book is The Lord of the Rings, and you keep sampling, you may eventually pull out "Frodo" or "ring". The longer the word (kmer) the more distinctive it is going to be (e.g. "ring" vs "Middle-earth")

Now, lets say you pull enough out that you feel reasonably confident that the book is LotR, but you can't say for certain because perhaps it's The Hobbit. This is what these methods effectively are doing, you're subsampling enough bits from all over the genome (book) to be reasonably sure its closer to A (LotR) than B (The Hobbit). The trick is sampling enough unique words (kmers) to be able to distinguish down to an appropriate level of resolution. Perhaps you only need to know the genus ("Tolkien books"), but maybe you need to know the species (hobbit vs LotR).

If you pick 10 words/kmers (especially if they're short), you are going to struggle to tell LotR & the Hobbit apart from e.g. The Curious Incident of the Dog in the Nighttime. The more you sample, the more sure you'll be.

ADD REPLY • link 3.9 years ago by Joe 21k

1

Entering edit mode

This is probably the best answer I have seen so far, not to mention the funniest :P Thanks. I think that answered my question.

ADD REPLY • link 3.9 years ago by GoldenRetriever ▴ 40

1

Entering edit mode

No idea about a definitive resource, but a cursory google turns up things like:

https://datasciencehongkong.files.wordpress.com/2018/02/dna2vec.pdf

https://academic.oup.com/gigascience/article/8/2/giy165/5266304

https://angus.readthedocs.io/en/2017/kmers-and-sourmash.html

If you look up "kmer distances", "kmer cosine similarity" and so on you'll find the above.

In particular you might want to look in to mash distances as per the last link which are a similar kind of 'abstract similarity'.

Unfortunately you're not going to escape the bio bit of bioinformatics when reading these resources.

ADD REPLY • link 3.9 years ago by Joe 21k

1

Entering edit mode

Introduction sections for the papers have cross-references that you will likely find useful. hash (LINK) describes the original technique for large scale-sequence comparisons.

we even can use k-mer sequencing for determining whether two strings are similar or not.

Not sure what you exactly mean by that. DNA is always sequenced as a single k-mer of a certain length (depending on technology). Then we use various alignment techniques to identify similarities in pre-existing sequences or databases.

ADD REPLY • link 3.9 years ago by GenoMax 147k

1

Entering edit mode

The most difficult thing for me is to actually phrase my questions due to my very little knowledge. I think I got the point now from Joe. I'll have a look at the paper you linked to also. Thanks.

ADD REPLY • link 3.9 years ago by GoldenRetriever ▴ 40