Counting distance between kmer instances in a sequence
1
0
Entering edit mode
16 months ago
schlogl ▴ 160

I want to calculate the distance between k-mers instances in a long sequence I developed this code that look for starts of the k-mers and then calculate the distance between the starts. I am not sure I am doing it right:

from itertools import pairwise

seq = "ACCGGCTTTAACGGCCTACGCGTTTTAAGCCGG"
di = "CG"

def get_inter_kmer_distance(seq, kmer):
    lk = len(kmer)
    starts = []
    for i, _ in enumerate(seq):
        di = seq[i:i+lk]
        if di == kmer:
            starts.append(i)
    #print(list(pairwise(starts)))
    dist = [j - i for i, j in pairwise(starts)]
    return dist, starts

get_inter_kmer_distance(seq, di)
([9, 7, 2, 10], [2, 11, 18, 20, 30])

get_inter_kmer_distance(seq, "AC")
([10, 7], [0, 10, 17])

I really have doubts if I have do overlaps or just plain divide the sequence in the kmers (like, "AC CG GC TT TA AC GG CC TA CG CG TT TT AA GC CG G"). Any suggestion to correct or improve the code would be appreciated.

Thank you for your time.

Paulo

PS - I did ask this at CODE REVIEW, but no help.

kmers python inter-distance R • 1.1k views
ADD COMMENT
5
Entering edit mode
16 months ago
DareDevil ★ 4.3k

Your code looks mostly correct for calculating the distance between k-mer instances in a sequence. However, there are a couple of minor improvements that can be made.

  1. Importing pairwise from itertools: In your code, you have imported the pairwise function from itertools correctly. However, you're not utilizing it in the code. To calculate the distance between consecutive elements in the starts list, you can directly use the pairwise function like this:
from itertools import pairwise

dist = [j - i for i, j in pairwise(starts)]

This eliminates the need for an additional loop to calculate the distances manually.

  1. Variable name conflict: There is a variable name conflict in your code. You have used di both as a parameter for the function and also within the loop. This can cause confusion and potential errors. It's better to rename the variable inside the loop to something different, such as subseq or current_kmer, to avoid conflicts.

Taking these improvements into account, here's the updated version of your code:

from itertools import pairwise

seq = "ACCGGCTTTAACGGCCTACGCGTTTTAAGCCGG"
di = "CG"

def get_inter_kmer_distance(seq, kmer):
    lk = len(kmer)
    starts = []
    for i, _ in enumerate(seq):
        subseq = seq[i:i+lk]
        if subseq == kmer:
            starts.append(i)
    dist = [j - i for i, j in pairwise(starts)]
    return dist, starts

distances, starts = get_inter_kmer_distance(seq, di)
print("Distances between k-mer instances:", distances)
print("Start positions of k-mer instances:", starts)

This updated code should give you the correct distances between k-mer instances and the start positions of each k-mer instance in the sequence.

ADD COMMENT
0
Entering edit mode

from some reasons of version or other stuff the code only works changing :

from itertools import pairwise to from more_itertools import pairwise.

ADD REPLY

Login before adding your answer.

Traffic: 2850 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6