Hii all, I want to convert fasta sequences into non-overlapping k-mers and for that I have used python. I wrote a code too but I am not getting non-overlapping k-mers. k=30, i.e. 30-mer. Please help me regarding this.
import os
import pandas as pd
import numpy as np
from motif_utils import seq2kmer
data=pd.read_csv(r'/home/smrutip/DNABERT/examples/sample_data/pre/datasets.sequences.fasta')
for indexs in data.index:
# print(data.loc[indexs].values[0])
seq = data.loc[indexs].values[0]
kerm = seq2kmer(seq, 30)
# print(type(kerm))
# print(kerm)
with open('dataset.txt', 'a') as f:
f.write(kerm + '\n')
def seq2kmer(seq, k):
"""
Convert original sequence to kmers
Arguments:
seq -- str, original sequence.
k -- int, kmer of length k specified.
Returns:
kmers -- str, kmers separated by space
"""
kmer = [seq[x:x+k] for x in range(len(seq)+30-k)]
kmers = " ".join(kmer)
return kmers
Andrzej Zielezinski it is still giving overlapping. I want 30-mer non-overlapping.
Can you share your output please? That solution above should absolutely not give you overlapping k-mers.
I am getting overlapping k-mers: AAGGTTTATACCTTCCCAGGTAACAAACCA AGGTTTATACCTTCCCAGGTAACAAACCAA But I dont need AAGGTTTATACCTTCCCAGGTAACAAACCA and after this more 30-mers
As Dunois pointed, you won't get overlapping kmers if you use the suggestion above. Here's an example:
please take this sequence: AAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATC TGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACAC GAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTTGTC CGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGT TTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTA AAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGT TCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTCGTAG TGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACG GTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCT TATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAACGGAGG GGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCAC GTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCGTGAA CATGAGCATGAAATTGCTTGGTACACGGAACGTTCTGAAAAGAGCTATGAATTGCAGACACCTTTTGAAATTAAATTGGC AAAGAAATTTGACATCTTCAATGGGGAATGTCCAAATTTTGTATTTCCCTTAAATTCCATAATCAAGACTATTCAACCAA GGGTTGAAAAGAAAAAGCTTGATGGCTTTATGGGTAGAATTCGATCTGTCTATCCAGTTGCGTCACCAAATGAATGCAAC CAAATGTGCCTTTCAACTCTCATGAAGTGTGATCATTGTGGTGAAACTTCATGGCAGACGGGCGATTTTGTTAAAGCCAC TTGCGAATTTTGTGGCACTGAGAATTTGACTAAAGAAGGTGCCACTACTTGTGGTTACTTACCCCAAAATGCTGTTGTTA AAATTTATTGTCCAGCATGTCACAATTCAGAAGTAGGACCTGAGCATAGTCTTGCCGAATACCATAATGAATCTGGCTTG AAAACCATTCTTCGTAAGGGTGGTCGCACTATTGCCTTTGGAGGCTGTGTGTTCTCTTATGTTGGTTGCCATAACAAGTG TGCCTATTGGGTTCCACGTGCTAGCGCTAACATAGGTTGTAACCATACAGGTGTTGTTGGAGAAGGTTCCGAAGGTCTTA ATGACAACCTTCTTGAAATACTCCAAAAAGAGAAAGTCAACATCAATATTGTTGGTGACTTTAAACTTAATGAAGAGATC GCCATTATTTTGGCATCTTTTTCTGCTTCCACAAGTGCTTTTGTGGAAACTGTGAAAGGTTTGGATTATAAAGCATTCAA ACAAATTGTTGAATCCTGTGGTAATTTTAAAGTTACAAAAGGAAAAGCTAAAAAAGGTGCCTGGAATATTGGTGAACAGA AATCAATACTGAGTCCTCTTTATGCATTTGCATCAGAGGCTGCTCGTGTTGTACGATCAATTTTCTCCCGCACTCTTGAA ACTGCTCAAAATTCTGTGCGTGTTTTACAGAAGGCCGCTATAACAATACTAGATGGAATTTCACAGTATTCACTGAGACT CATTGATGCTATGATGTTCACATCTGATTTGGCTACTAACAATCTAGTTGTAATGGCCTACATTACAGGTGGTGTTGTTC AGTTGACTTCGCAGTGGCTAACTAACATCTTTGGCACTGTTTATGAAAAACTCAAACCCGTCCTTGATTGGCTTGAAGAG AAGTTTAAGGAAGGTGTAGAGTTTCTTAGAGACGGTTGGGAAATTGTTAAATTTATCTCAACCTGTGCTTGTGAAATTGT CGGTGGACAAATTGTCACCTGTGCAAAGGAAATTAAGGAGAGTGTTCAGACATTCTTTAAGCTTGTAAATAAATTTTTGG CTTTGTGTGCTGACTCTATCATTATTGGTGGAGCTAAACTTAAAGCCTTGAATTTAGGTGAAACATTTGTCACGCACTCA AAGGGATTGTACAGAAAGTGTGTTAAATCCAGAGAAGAAACTGGCCTACTCATGCCTCTAAAAGCCCCAAAAGAAATTAT CTTCTTAGAGGGAGAAACACTTCCCACAGAAGTGTTAACAGAGGAAGTTGTCTTGAAAACTGGTGATTTACAACCATTAG AACAACCTACTAGTGAAGCTGTTGAAGCTCCATTGGTTGGTACACCAGTTTGTATTAACGGGCTTATGTTGCTCGAAATC AAAGACACAGAAAAGTACTGTGCCCTTGCACCTAATATGATGGTAACAAACAATACCTTCACACTCAAAGGCGGTGCACC AACAAAGGTTACTTTTGGTGATGACACTGTGATAGAAGTGCAAGGTTACAAGAGTGTGAATATCACTTTTGAACTTGATG AAAGGATTGATAAAGTACTTAATGAGAAGTGCTCTGCCTATACAGTTGAACTCGGTACAGAAGTAAATGAGTTCGCCTGT
smrutimayipanda :
Please use ADD REPLY when responding to existing posts. ADD ANSWERS is meant for new answers to the original question.
Do not post follow-up/additional material as new answers.
above is the sequence in text. Please try with this
Why does your sequence have whitespaces in it? Regardless, running this through
Would produce this:
Where exactly are the overlapping
k-mers
in there?Dunois you are running only one command on these sequences. Can you please store all these sequences and then run this command? these is a big fasta file, so I have to run on that
Traceback (most recent call last): File "test.py", line 30, in <module> kmer = [seq[x:x+k] for x in range(0, len(seq)-k+1, k)] TypeError: object of type 'float' has no len()
I am getting this error when I used -k+1 in len(seq). Thats why I am saying run the full script
Why are you using
pandas
to read a simple text file? This is likely where your issues are coming in from - not the kmer calculation.then how can i read the fasta file? can you please tell me?
You should use
biopython
, but you can use any standard line-by-line python file readon method as long as you handle the headers etc properly.