Here, try this, this is a full solution. You'll need to have biopython
(Bio
) and pandas
installed via pip
or conda
for this to work. collections
should be available already.
The code:
import pandas as pd
from Bio import SeqIO
from collections import defaultdict
def get_ukmers(seq, k = 30):
kmers = [seq[x:x+k] for x in range(0, len(seq)-k+1, k)]
kout = ";".join(str(kmer) for kmer in kmers)
return(kout)
def extr_nlkmers(inpfa, outfile, k = 30):
tmpdict = defaultdict(list)
with open(inpfa) as fas:
print("Extracting non-overlapping k-mers from file:", str(inpfa))
for record in SeqIO.parse(fas, "fasta"):
print("Opening sequence:", record.name)
tmpdict['header'].append(record.name)
tmpdict['seq'].append(''.join(str(nuc) for nuc in record.seq))
tmpdict['nl_kmers'].append(get_ukmers(record.seq, k))
print("Done handling sequence:", record.name)
print("All done.")
fas.close
df = pd.DataFrame.from_dict(tmpdict)
print("Data for all sequences written to file:", outfile)
df.to_csv(outfile)
return(df)
You can execute it like so:
#Set a path for an input and output file.
#The input file is a FASTA file.
#The output file is a .CSV file
myin = "test.fasta"
myout = "unique_kmers.csv"
#Extracting non-overlapping k-mers and
#writing them to file.
#The data.frame is also retained in the
#environment
df = extr_nlkmers(myin, myout, k = 30)
df
#Output
# Extracting non-overlapping k-mers from file: test.fasta
# Opening sequence: seq0
# Done handling sequence: seq0
# Opening sequence: seq1
# Done handling sequence: seq1
# Opening sequence: seq2
# Done handling sequence: seq2
# Opening sequence: seq3
# Done handling sequence: seq3
# Opening sequence: seq4
# Done handling sequence: seq4
# Opening sequence: seq5
# Done handling sequence: seq5
# Opening sequence: seq6
# Done handling sequence: seq6
# Opening sequence: seq7
# Done handling sequence: seq7
# Opening sequence: seq8
# Done handling sequence: seq8
# Opening sequence: seq9
# Done handling sequence: seq9
# Opening sequence: seq10
# Done handling sequence: seq10
# Opening sequence: seq11
# Done handling sequence: seq11
# Opening sequence: seq12
# Done handling sequence: seq12
# Opening sequence: seq13
# Done handling sequence: seq13
# Opening sequence: seq14
# Done handling sequence: seq14
# Opening sequence: seq15
# Done handling sequence: seq15
# Opening sequence: seq16
# Done handling sequence: seq16
# Opening sequence: seq17
# Done handling sequence: seq17
# Opening sequence: seq18
# Done handling sequence: seq18
# Opening sequence: seq19
# Done handling sequence: seq19
# Opening sequence: seq20
# Done handling sequence: seq20
# Opening sequence: seq21
# Done handling sequence: seq21
# Opening sequence: seq22
# Done handling sequence: seq22
# Opening sequence: seq23
# Done handling sequence: seq23
# Opening sequence: seq24
# Done handling sequence: seq24
# Opening sequence: seq25
# Done handling sequence: seq25
# Opening sequence: seq26
# Done handling sequence: seq26
# Opening sequence: seq27
# Done handling sequence: seq27
# Opening sequence: seq28
# Done handling sequence: seq28
# Opening sequence: seq29
# Done handling sequence: seq29
# Opening sequence: seq30
# Done handling sequence: seq30
# Opening sequence: seq31
# Done handling sequence: seq31
# Opening sequence: seq32
# Done handling sequence: seq32
# Opening sequence: seq33
# Done handling sequence: seq33
# Opening sequence: seq34
# Done handling sequence: seq34
# Opening sequence: seq35
# Done handling sequence: seq35
# All done.
# Data for all sequences written to file: unique_kmers.csv
And you'll also get a CSV
table that looks like this:
# header seq nl_kmers
# 0 seq0 AAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTC... AAGGTTTATACCTTCCCAGGTAACAAACCA;ACCAACTTTCGATCT...
# 1 seq1 TGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTAT... TGTGTGGCTGTCACTCGGCTGCATGCTTAG;TGCACTCACGCAGTA...
# 2 seq2 GAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG... GAGTAACTCGTCTATCTTCTGCAGGCTGCT;TACGGTTTCGTCCGT...
# 3 seq3 CGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCA... CGGGTGTGACCGAAAGGTAAGATGGAGAGC;CTTGTCCCTGGTTTC...
# 4 seq4 TTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTG... TTTACAGGTTCGCGACGTGCTCGTACGTGG;CTTTGGAGACTCCGT...
# 5 seq5 AAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCC... AAGATGGCACTTGTGGCTTAGTAGAAGTTG;AAAAAGGCGTTTTGC...
# 6 seq6 TCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGG... TCGGATGCTCGAACTGCACCTCATGGTCAT;GTTATGGTTGAGCTG...
# 7 seq7 TGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCA... TGGTGAGACACTTGGTGTCCTTGTCCCTCA;TGTGGGCGAAATACC...
# 8 seq8 GTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTC... GTAATAAAGGAGCTGGTGGCCATAGTTACG;GCGCCGATCTAAAGT...
# 9 seq9 TATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTG... TATGAAGATTTTCAAGAAAACTGGAACACT;AAACATAGCAGTGGT...
# 10 seq10 GGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGC... GGCATACACTCGCTATGTCGATAACAACTT;CTGTGGCCCTGATGG...
# 11 seq11 GTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTAT... GTGCTGGTAAAGCTTCATGCACTTTGTCCG;AACAACTGGACTTTA...
# 12 seq12 CATGAGCATGAAATTGCTTGGTACACGGAACGTTCTGAAAAGAGCT... CATGAGCATGAAATTGCTTGGTACACGGAA;CGTTCTGAAAAGAGC...
# 13 seq13 AAAGAAATTTGACATCTTCAATGGGGAATGTCCAAATTTTGTATTT... AAAGAAATTTGACATCTTCAATGGGGAATG;TCCAAATTTTGTATT...
# 14 seq14 GGGTTGAAAAGAAAAAGCTTGATGGCTTTATGGGTAGAATTCGATC... GGGTTGAAAAGAAAAAGCTTGATGGCTTTA;TGGGTAGAATTCGAT...
# 15 seq15 CAAATGTGCCTTTCAACTCTCATGAAGTGTGATCATTGTGGTGAAA... CAAATGTGCCTTTCAACTCTCATGAAGTGT;GATCATTGTGGTGAA...
# 16 seq16 TTGCGAATTTTGTGGCACTGAGAATTTGACTAAAGAAGGTGCCACT... TTGCGAATTTTGTGGCACTGAGAATTTGAC;TAAAGAAGGTGCCAC...
# 17 seq17 AAATTTATTGTCCAGCATGTCACAATTCAGAAGTAGGACCTGAGCA... AAATTTATTGTCCAGCATGTCACAATTCAG;AAGTAGGACCTGAGC...
# 18 seq18 AAAACCATTCTTCGTAAGGGTGGTCGCACTATTGCCTTTGGAGGCT... AAAACCATTCTTCGTAAGGGTGGTCGCACT;ATTGCCTTTGGAGGC...
# 19 seq19 TGCCTATTGGGTTCCACGTGCTAGCGCTAACATAGGTTGTAACCAT... TGCCTATTGGGTTCCACGTGCTAGCGCTAA;CATAGGTTGTAACCA...
# 20 seq20 ATGACAACCTTCTTGAAATACTCCAAAAAGAGAAAGTCAACATCAA... ATGACAACCTTCTTGAAATACTCCAAAAAG;AGAAAGTCAACATCA...
# 21 seq21 GCCATTATTTTGGCATCTTTTTCTGCTTCCACAAGTGCTTTTGTGG... GCCATTATTTTGGCATCTTTTTCTGCTTCC;ACAAGTGCTTTTGTG...
# 22 seq22 ACAAATTGTTGAATCCTGTGGTAATTTTAAAGTTACAAAAGGAAAA... ACAAATTGTTGAATCCTGTGGTAATTTTAA;AGTTACAAAAGGAAA...
# 23 seq23 AATCAATACTGAGTCCTCTTTATGCATTTGCATCAGAGGCTGCTCG... AATCAATACTGAGTCCTCTTTATGCATTTG;CATCAGAGGCTGCTC...
# 24 seq24 ACTGCTCAAAATTCTGTGCGTGTTTTACAGAAGGCCGCTATAACAA... ACTGCTCAAAATTCTGTGCGTGTTTTACAG;AAGGCCGCTATAACA...
# 25 seq25 CATTGATGCTATGATGTTCACATCTGATTTGGCTACTAACAATCTA... CATTGATGCTATGATGTTCACATCTGATTT;GGCTACTAACAATCT...
# 26 seq26 AGTTGACTTCGCAGTGGCTAACTAACATCTTTGGCACTGTTTATGA... AGTTGACTTCGCAGTGGCTAACTAACATCT;TTGGCACTGTTTATG...
# 27 seq27 AAGTTTAAGGAAGGTGTAGAGTTTCTTAGAGACGGTTGGGAAATTG... AAGTTTAAGGAAGGTGTAGAGTTTCTTAGA;GACGGTTGGGAAATT...
# 28 seq28 CGGTGGACAAATTGTCACCTGTGCAAAGGAAATTAAGGAGAGTGTT... CGGTGGACAAATTGTCACCTGTGCAAAGGA;AATTAAGGAGAGTGT...
# 29 seq29 CTTTGTGTGCTGACTCTATCATTATTGGTGGAGCTAAACTTAAAGC... CTTTGTGTGCTGACTCTATCATTATTGGTG;GAGCTAAACTTAAAG...
# 30 seq30 AAGGGATTGTACAGAAAGTGTGTTAAATCCAGAGAAGAAACTGGCC... AAGGGATTGTACAGAAAGTGTGTTAAATCC;AGAGAAGAAACTGGC...
# 31 seq31 CTTCTTAGAGGGAGAAACACTTCCCACAGAAGTGTTAACAGAGGAA... CTTCTTAGAGGGAGAAACACTTCCCACAGA;AGTGTTAACAGAGGA...
# 32 seq32 AACAACCTACTAGTGAAGCTGTTGAAGCTCCATTGGTTGGTACACC... AACAACCTACTAGTGAAGCTGTTGAAGCTC;CATTGGTTGGTACAC...
# 33 seq33 AAAGACACAGAAAAGTACTGTGCCCTTGCACCTAATATGATGGTAA... AAAGACACAGAAAAGTACTGTGCCCTTGCA;CCTAATATGATGGTA...
# 34 seq34 AACAAAGGTTACTTTTGGTGATGACACTGTGATAGAAGTGCAAGGT... AACAAAGGTTACTTTTGGTGATGACACTGT;GATAGAAGTGCAAGG...
# 35 seq35 AAAGGATTGATAAAGTACTTAATGAGAAGTGCTCTGCCTATACAGT... AAAGGATTGATAAAGTACTTAATGAGAAGT;GCTCTGCCTATACAG...
And here, if what you're worried about are sets of k-mers like AAGGTTTATACCTTCCCAGGTAACAAACCA;ACCAACTTTCGATCTCTTGTAGATCTGTTC
(from the very first sequence), these aren't overlapping k-mers. It so happens that the parent sequence has a repeat like so:
"AAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATC"
You're confusing overlaps in the k-mer sequence with overlapping k-mers (overlapping k-mers are generated via a sliding window approach).
Andrzej Zielezinski it is still giving overlapping. I want 30-mer non-overlapping.
Can you share your output please? That solution above should absolutely not give you overlapping k-mers.
I am getting overlapping k-mers: AAGGTTTATACCTTCCCAGGTAACAAACCA AGGTTTATACCTTCCCAGGTAACAAACCAA But I dont need AAGGTTTATACCTTCCCAGGTAACAAACCA and after this more 30-mers
As Dunois pointed, you won't get overlapping kmers if you use the suggestion above. Here's an example:
please take this sequence: AAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATC TGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACAC GAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTTGTC CGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGT TTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTA AAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGT TCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTCGTAG TGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACG GTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCT TATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAACGGAGG GGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCAC GTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCGTGAA CATGAGCATGAAATTGCTTGGTACACGGAACGTTCTGAAAAGAGCTATGAATTGCAGACACCTTTTGAAATTAAATTGGC AAAGAAATTTGACATCTTCAATGGGGAATGTCCAAATTTTGTATTTCCCTTAAATTCCATAATCAAGACTATTCAACCAA GGGTTGAAAAGAAAAAGCTTGATGGCTTTATGGGTAGAATTCGATCTGTCTATCCAGTTGCGTCACCAAATGAATGCAAC CAAATGTGCCTTTCAACTCTCATGAAGTGTGATCATTGTGGTGAAACTTCATGGCAGACGGGCGATTTTGTTAAAGCCAC TTGCGAATTTTGTGGCACTGAGAATTTGACTAAAGAAGGTGCCACTACTTGTGGTTACTTACCCCAAAATGCTGTTGTTA AAATTTATTGTCCAGCATGTCACAATTCAGAAGTAGGACCTGAGCATAGTCTTGCCGAATACCATAATGAATCTGGCTTG AAAACCATTCTTCGTAAGGGTGGTCGCACTATTGCCTTTGGAGGCTGTGTGTTCTCTTATGTTGGTTGCCATAACAAGTG TGCCTATTGGGTTCCACGTGCTAGCGCTAACATAGGTTGTAACCATACAGGTGTTGTTGGAGAAGGTTCCGAAGGTCTTA ATGACAACCTTCTTGAAATACTCCAAAAAGAGAAAGTCAACATCAATATTGTTGGTGACTTTAAACTTAATGAAGAGATC GCCATTATTTTGGCATCTTTTTCTGCTTCCACAAGTGCTTTTGTGGAAACTGTGAAAGGTTTGGATTATAAAGCATTCAA ACAAATTGTTGAATCCTGTGGTAATTTTAAAGTTACAAAAGGAAAAGCTAAAAAAGGTGCCTGGAATATTGGTGAACAGA AATCAATACTGAGTCCTCTTTATGCATTTGCATCAGAGGCTGCTCGTGTTGTACGATCAATTTTCTCCCGCACTCTTGAA ACTGCTCAAAATTCTGTGCGTGTTTTACAGAAGGCCGCTATAACAATACTAGATGGAATTTCACAGTATTCACTGAGACT CATTGATGCTATGATGTTCACATCTGATTTGGCTACTAACAATCTAGTTGTAATGGCCTACATTACAGGTGGTGTTGTTC AGTTGACTTCGCAGTGGCTAACTAACATCTTTGGCACTGTTTATGAAAAACTCAAACCCGTCCTTGATTGGCTTGAAGAG AAGTTTAAGGAAGGTGTAGAGTTTCTTAGAGACGGTTGGGAAATTGTTAAATTTATCTCAACCTGTGCTTGTGAAATTGT CGGTGGACAAATTGTCACCTGTGCAAAGGAAATTAAGGAGAGTGTTCAGACATTCTTTAAGCTTGTAAATAAATTTTTGG CTTTGTGTGCTGACTCTATCATTATTGGTGGAGCTAAACTTAAAGCCTTGAATTTAGGTGAAACATTTGTCACGCACTCA AAGGGATTGTACAGAAAGTGTGTTAAATCCAGAGAAGAAACTGGCCTACTCATGCCTCTAAAAGCCCCAAAAGAAATTAT CTTCTTAGAGGGAGAAACACTTCCCACAGAAGTGTTAACAGAGGAAGTTGTCTTGAAAACTGGTGATTTACAACCATTAG AACAACCTACTAGTGAAGCTGTTGAAGCTCCATTGGTTGGTACACCAGTTTGTATTAACGGGCTTATGTTGCTCGAAATC AAAGACACAGAAAAGTACTGTGCCCTTGCACCTAATATGATGGTAACAAACAATACCTTCACACTCAAAGGCGGTGCACC AACAAAGGTTACTTTTGGTGATGACACTGTGATAGAAGTGCAAGGTTACAAGAGTGTGAATATCACTTTTGAACTTGATG AAAGGATTGATAAAGTACTTAATGAGAAGTGCTCTGCCTATACAGTTGAACTCGGTACAGAAGTAAATGAGTTCGCCTGT
smrutimayipanda :
Please use ADD REPLY when responding to existing posts. ADD ANSWERS is meant for new answers to the original question.
Do not post follow-up/additional material as new answers.
above is the sequence in text. Please try with this
Why does your sequence have whitespaces in it? Regardless, running this through
Would produce this:
Where exactly are the overlapping
k-mers
in there?Dunois you are running only one command on these sequences. Can you please store all these sequences and then run this command? these is a big fasta file, so I have to run on that
Traceback (most recent call last): File "test.py", line 30, in <module> kmer = [seq[x:x+k] for x in range(0, len(seq)-k+1, k)] TypeError: object of type 'float' has no len()
I am getting this error when I used -k+1 in len(seq). Thats why I am saying run the full script
Why are you using
pandas
to read a simple text file? This is likely where your issues are coming in from - not the kmer calculation.then how can i read the fasta file? can you please tell me?
You should use
biopython
, but you can use any standard line-by-line python file readon method as long as you handle the headers etc properly.