Question

Understanding the Principle and Mechanism of K-mer Alignment in Bioinformatics

0

Entering edit mode

5 months ago

Dim • 0

Hello community.

I'm currently exploring the concept of k-mer alignment in bioinformatics and came across some information suggesting that the k-mer approach is considered a non-alignment method. Could someone clarify how k-mer alignment actually functions? Specifically, I'm trying to understand the mechanism behind it.

Additionally, I'm confused about the process of hashing in relation to k-mers. Is it the query sequences that are converted into k-mers for alignment against a database, or are the database sequences hashed into k-mers that then align with multiple query sequences? This question came up during a discussion about adjusting the -k parameter, which controls the k-mer length.

Any insights or explanations would be greatly appreciated.

k-mers alignment • 395 views

ADD COMMENT • link updated 5 months ago by GenoMax 148k • written 5 months ago by Dim • 0

2

Entering edit mode

I guess you’re referring to the kallisto+salmon methods. You look at both k-mers in your reference database AND the k-mers in your sequencing reads.

Let’s say I have an organism that only can produce the following 4 RNA transcripts:

Transcript 1: TCGGGC

Transcript 2: AACGG

Transcript 3: CCCAA

Transcript 4: AAAAA

If we choose k=3, you can imagine the following “lookup table” being created to associate each k-mer to transcripts:

TCG: transcript 1

CGG: transcript 1+2

GGG: transcript 1

GGC: transcript 1

AAC: transcript 2

ACG: transcript 2

CCC: transcript 3

CCA: transcript 3

CAA: transcript 3

AAA: transcript 4

This provides a useful lookup dictionary (i.e. a “hash table”).

Now, my RNAseq reads (read length: 4 bps, for example purposes) come off the machine and, we want to make sense of those reads (where in the organism’s transcriptome did each read come from?). Let’s say, one read in my data that came off the machine looks like:

CCAA

The read has the k-mers CCA and CAA. Let’s look them up in that lookup table above (yeah, we have to use k=3 for our reads in order to do the “look ups”)… wow, both belong transcript 3. We can say that the RNAseq read came from transcript 3!

Let’s try another read:

CGGA

Ok, the first k-mer CGG maps to transcripts 1+2. The second k-mer GGA doesn’t exist in that lookup table. We’ll just say that the read could have come from either transcript 1 or transcript 2.

Let’s say we have the read:

TTTT

That read has one k-mer (TTT). Welp, doesn’t appear anywhere in the lookup table. We’ll just say that read is unmapped/unaligned.

Et cetera.

I’m sure others can give you a more technically comprehensive+rigorous answer, so I’m leaving this post as a comment rather than as an answer, but I hope this helps in your understanding!

ADD REPLY • link updated 5 months ago by GenoMax 148k • written 5 months ago by dsull ★ 7.2k