If I am not mistaken, your code are trying just to find a list of a unique k-mers. But we need to find unique k-mers for each clusters. By unique k-mers I mean such k-mers which occur in each (or almost each) sequence of a given cluster but is absent (or almost absent) in sequences of another cluster.
Then you could rewrite it so that the detected kmers are stored in a dictionary as keys and each time they occur that the value increases. Then you could look at each kmer that occured once and compare them between your clusters.
Maybe there is a tool for this, but I'm not aware of any.
when I tried this: k=0
unique=[]
genome_length=len(sequence) - 23
while k < genome_length:
kmer = sequence[k:k+23]
if kmer in unique:
k+=1
else:
unique.append(kmer)
k+=1
print(unique)
I got the list of kmers. Then when I applied this: def get_unique(in_list):
объявление пустого списка
unq_list = []
Итерация по списку
for x in in_list:
# если значения x нету в unq_list то добавляем
if x not in unq_list:
unq_list.append(x)
вывод списка
for x in unq_list:
print(x)
my_list = unique
print("Уникальным значениями в списке {0} являются".format(my_list))
get_unique(my_list)
I just got the similar list : Уникальным значениями в списке ['TAGCAACCCTAGCCTCCGGCTAA', 'AGCAACCCTAGCCTCCGGCTAAG', 'GCAACCCTAGCCTCCGGCTAAGC', 'CAACCCTAGCCTCCGGCTAAGCT', 'AACCCTAGCCTCCGGCTAAGCTT', 'ACCCTAGCCTCCGGCTAAGCTTC', 'CCCTAGCCTCCGGCTAAGCTTCC', 'CCTAGCCTCCGGCTAAGCTTCCT', 'CTAGCCTCCGGCTAAGCTTCCTC', 'TAGCCTCCGGCTAAGCTTCCTCC', 'AGCCTCCGGCTAAGCTTCCTCCT', 'GCCTCCGGCTAAGCTTCCTCCTC', 'CCTCCGGCTAAGCTTCCTCCTCG', 'CTCCGGCTAAGCTTCCTCCTCGG', 'TCCGGCTAAGCTTCCTCCTCGGC', 'CCGGCTAAGCTTCCTCCTCGGCG', 'CGGCTAAGCTTCCTCCTCGGCGT', 'GGCTAAGCTTCCTCCTCGGCGTG', 'GCTAAGCTTCCTCCTCGGCGTGT', 'CTAAGCTTCCTCCTCGGCGTGTC', 'TAAGCTTCCTCCTCGGCGTGTCT', 'AAGCTTCCTCCTCGGCGTGTCTA', 'AGCTTCCTCCTCGGCGTGTCTAA', 'GCTTCCTCCTCGGCGTGTCTAAA', 'CTTCCTCCTCGGCGTGTCTAAAC', 'TTCCTCCTCGGCGTGTCTAAACC', 'TCCTCCTCGGCGTGTCTAAACCC', 'CCTCCTCGGCGTGTCTAAACCCT', 'CTCCTCGGCGTGTCTAAACCCTA', 'TCCTCGGCGTGTCTAAACCCTAG', 'CCTCGGCGTGTCTAAACCCTAGA', 'CTCGGCGTGTCTAAACCCTAGAT', 'TCGGCGTGTCTAAACCCTAGATC', 'CGGCGTGTCTAAACCCTAGATCG', 'GGCGTGTCTAAACCCTAGATCGT', 'GCGTGTCTAAACCCTAGATCGTC', 'CGTGTCTAAACCCTAGATCGTCG', 'GTGTCTAAACCCTAGATCGTCGA', 'TGTCTAAACCCTAGATCGTCGAG', 'GTCTAAACCCTAGATCGTCGAGG', 'TCTAAACCCTAGATCGTCGAGGA', 'CTAAACCCTAGATCGTCGAGGAA', 'TAAACCCTAGATCGTCGAGGAAC', 'AAACCCTAGATCGTCGAGGAACT', 'AACCCTAGATCGTCGAGGAACTC', 'ACCCTAGATCGTCGAGGAACTCT', 'CCCTAGATCGTCGAGGAACTCTC', 'CCTAGATCGTCGAGGAACTCTCT', 'CTAGATCGTCGAGGAACTCTCTC', .....
What did you try so far?