I want to calculate the distance between k-mers instances in a long sequence I developed this code that look for starts of the k-mers and then calculate the distance between the starts. I am not sure I am doing it right:
from itertools import pairwise
seq = "ACCGGCTTTAACGGCCTACGCGTTTTAAGCCGG"
di = "CG"
def get_inter_kmer_distance(seq, kmer):
lk = len(kmer)
starts = []
for i, _ in enumerate(seq):
di = seq[i:i+lk]
if di == kmer:
starts.append(i)
#print(list(pairwise(starts)))
dist = [j - i for i, j in pairwise(starts)]
return dist, starts
get_inter_kmer_distance(seq, di)
([9, 7, 2, 10], [2, 11, 18, 20, 30])
get_inter_kmer_distance(seq, "AC")
([10, 7], [0, 10, 17])
I really have doubts if I have do overlaps or just plain divide the sequence in the kmers (like, "AC CG GC TT TA AC GG CC TA CG CG TT TT AA GC CG G"). Any suggestion to correct or improve the code would be appreciated.
Thank you for your time.
Paulo
PS - I did ask this at CODE REVIEW, but no help.
from some reasons of version or other stuff the code only works changing :
from itertools import pairwise to from more_itertools import pairwise.