fasta sequences into non-overlapping k-mers using python
3
0
Entering edit mode
2.1 years ago

Hii all, I want to convert fasta sequences into non-overlapping k-mers and for that I have used python. I wrote a code too but I am not getting non-overlapping k-mers. k=30, i.e. 30-mer. Please help me regarding this.

import os
import pandas as pd
import numpy as np
from motif_utils import seq2kmer


data=pd.read_csv(r'/home/smrutip/DNABERT/examples/sample_data/pre/datasets.sequences.fasta')
for indexs in data.index:
    # print(data.loc[indexs].values[0])
    seq = data.loc[indexs].values[0]
    kerm = seq2kmer(seq, 30)
    # print(type(kerm))
    # print(kerm)
    with open('dataset.txt', 'a') as f:
        f.write(kerm + '\n')

def seq2kmer(seq, k):
    """
    Convert original sequence to kmers

    Arguments:
    seq -- str, original sequence.
    k -- int, kmer of length k specified.

    Returns:
    kmers -- str, kmers separated by space

    """
    kmer = [seq[x:x+k] for x in range(len(seq)+30-k)]
    kmers = " ".join(kmer)
    return kmers
machine learning deep • 3.7k views
ADD COMMENT
3
Entering edit mode
2.1 years ago

To get non-overlapping kmers, you should set the step (in the range function) to k.

kmer = [seq[x:x+k] for x in range(0, len(seq)-k+1, k)]
ADD COMMENT
0
Entering edit mode

Andrzej Zielezinski it is still giving overlapping. I want 30-mer non-overlapping.

ADD REPLY
1
Entering edit mode

Can you share your output please? That solution above should absolutely not give you overlapping k-mers.

ADD REPLY
0
Entering edit mode

I am getting overlapping k-mers: AAGGTTTATACCTTCCCAGGTAACAAACCA AGGTTTATACCTTCCCAGGTAACAAACCAA But I dont need AAGGTTTATACCTTCCCAGGTAACAAACCA and after this more 30-mers

ADD REPLY
0
Entering edit mode

As Dunois pointed, you won't get overlapping kmers if you use the suggestion above. Here's an example:

seq = 'AAATTTGGG' 
k = 3
kmers = [seq[x:x+k] for x in range(0, len(seq)-k+1, k)]
print(kmers)       # ['AAA', 'TTT', 'GGG', 'CCC']
ADD REPLY
0
Entering edit mode

please take this sequence: AAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATC TGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACAC GAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTTGTC CGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGT TTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTA AAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGT TCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTCGTAG TGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACG GTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCT TATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAACGGAGG GGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCAC GTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCGTGAA CATGAGCATGAAATTGCTTGGTACACGGAACGTTCTGAAAAGAGCTATGAATTGCAGACACCTTTTGAAATTAAATTGGC AAAGAAATTTGACATCTTCAATGGGGAATGTCCAAATTTTGTATTTCCCTTAAATTCCATAATCAAGACTATTCAACCAA GGGTTGAAAAGAAAAAGCTTGATGGCTTTATGGGTAGAATTCGATCTGTCTATCCAGTTGCGTCACCAAATGAATGCAAC CAAATGTGCCTTTCAACTCTCATGAAGTGTGATCATTGTGGTGAAACTTCATGGCAGACGGGCGATTTTGTTAAAGCCAC TTGCGAATTTTGTGGCACTGAGAATTTGACTAAAGAAGGTGCCACTACTTGTGGTTACTTACCCCAAAATGCTGTTGTTA AAATTTATTGTCCAGCATGTCACAATTCAGAAGTAGGACCTGAGCATAGTCTTGCCGAATACCATAATGAATCTGGCTTG AAAACCATTCTTCGTAAGGGTGGTCGCACTATTGCCTTTGGAGGCTGTGTGTTCTCTTATGTTGGTTGCCATAACAAGTG TGCCTATTGGGTTCCACGTGCTAGCGCTAACATAGGTTGTAACCATACAGGTGTTGTTGGAGAAGGTTCCGAAGGTCTTA ATGACAACCTTCTTGAAATACTCCAAAAAGAGAAAGTCAACATCAATATTGTTGGTGACTTTAAACTTAATGAAGAGATC GCCATTATTTTGGCATCTTTTTCTGCTTCCACAAGTGCTTTTGTGGAAACTGTGAAAGGTTTGGATTATAAAGCATTCAA ACAAATTGTTGAATCCTGTGGTAATTTTAAAGTTACAAAAGGAAAAGCTAAAAAAGGTGCCTGGAATATTGGTGAACAGA AATCAATACTGAGTCCTCTTTATGCATTTGCATCAGAGGCTGCTCGTGTTGTACGATCAATTTTCTCCCGCACTCTTGAA ACTGCTCAAAATTCTGTGCGTGTTTTACAGAAGGCCGCTATAACAATACTAGATGGAATTTCACAGTATTCACTGAGACT CATTGATGCTATGATGTTCACATCTGATTTGGCTACTAACAATCTAGTTGTAATGGCCTACATTACAGGTGGTGTTGTTC AGTTGACTTCGCAGTGGCTAACTAACATCTTTGGCACTGTTTATGAAAAACTCAAACCCGTCCTTGATTGGCTTGAAGAG AAGTTTAAGGAAGGTGTAGAGTTTCTTAGAGACGGTTGGGAAATTGTTAAATTTATCTCAACCTGTGCTTGTGAAATTGT CGGTGGACAAATTGTCACCTGTGCAAAGGAAATTAAGGAGAGTGTTCAGACATTCTTTAAGCTTGTAAATAAATTTTTGG CTTTGTGTGCTGACTCTATCATTATTGGTGGAGCTAAACTTAAAGCCTTGAATTTAGGTGAAACATTTGTCACGCACTCA AAGGGATTGTACAGAAAGTGTGTTAAATCCAGAGAAGAAACTGGCCTACTCATGCCTCTAAAAGCCCCAAAAGAAATTAT CTTCTTAGAGGGAGAAACACTTCCCACAGAAGTGTTAACAGAGGAAGTTGTCTTGAAAACTGGTGATTTACAACCATTAG AACAACCTACTAGTGAAGCTGTTGAAGCTCCATTGGTTGGTACACCAGTTTGTATTAACGGGCTTATGTTGCTCGAAATC AAAGACACAGAAAAGTACTGTGCCCTTGCACCTAATATGATGGTAACAAACAATACCTTCACACTCAAAGGCGGTGCACC AACAAAGGTTACTTTTGGTGATGACACTGTGATAGAAGTGCAAGGTTACAAGAGTGTGAATATCACTTTTGAACTTGATG AAAGGATTGATAAAGTACTTAATGAGAAGTGCTCTGCCTATACAGTTGAACTCGGTACAGAAGTAAATGAGTTCGCCTGT

ADD REPLY
0
Entering edit mode

smrutimayipanda :

Please use ADD REPLY when responding to existing posts. ADD ANSWERS is meant for new answers to the original question.

Do not post follow-up/additional material as new answers.

ADD REPLY
0
Entering edit mode
data=pd.read_csv(r'/content/text.txt')
for indexs in data.index:
    #print(data.loc[indexs].values[0])
    seq = data.loc[indexs].values[0]
 k=30
kmer = [seq[x:x+k] for x in range(0, len(seq), k)]
kmers = ' '.join(kmer)
# Open a file with access mode 'a'
with open('sample.txt', 'a') as f:
# Append 'hello' at the end of file
  f.write(kmers)

above is the sequence in text. Please try with this

ADD REPLY
0
Entering edit mode

Why does your sequence have whitespaces in it? Regardless, running this through

kmers = [seq[x:x+k] for x in range(0, len(seq)-k+1, k)]

Would produce this:

 'ACCAACTTTCGATCTCTTGTAGATCTGTTC',
 'TCTAAACGAACTTTAAAATC TGTGTGGCT',
 'GTCACTCGGCTGCATGCTTAGTGCACTCAC',
 'GCAGTATAATTAATAACTAATTACTGTCGT',
 'TGACAGGACAC GAGTAACTCGTCTATCTT',
 'CTGCAGGCTGCTTACGGTTTCGTCCGTGTT',
 'GCAGCCGATCATCAGCACATCTAGGTTTTG',
 'TC CGGGTGTGACCGAAAGGTAAGATGGAG',
 'AGCCTTGTCCCTGGTTTCAACGAGAAAACA',
 'CACGTCCAACTCAGTTTGCCTGT TTTACA',
 'GGTTCGCGACGTGCTCGTACGTGGCTTTGG',
 'AGACTCCGTGGAGGAGGTCTTATCAGAGGC',
 'ACGTCAACATCTTA AAGATGGCACTTGTG',
 'GCTTAGTAGAAGTTGAAAAAGGCGTTTTGC',
 'CTCAACTTGAACAGCCCTATGTGTTCATCA',
 'AACGT TCGGATGCTCGAACTGCACCTCAT',
 'GGTCATGTTATGGTTGAGCTGGTAGCAGAA',
 'CTCGAAGGCATTCAGTACGGTCGTAG TGG',
 'TGAGACACTTGGTGTCCTTGTCCCTCATGT',
 'GGGCGAAATACCAGTGGCTTACCGCAAGGT',
 'TCTTCTTCGTAAGAACG GTAATAAAGGAG',
 'CTGGTGGCCATAGTTACGGCGCCGATCTAA',
 'AGTCATTTGACTTAGGCGACGAGCTTGGCA',
 'CTGATCCT TATGAAGATTTTCAAGAAAAC',
 'TGGAACACTAAACATAGCAGTGGTGTTACC',
 'CGTGAACTCATGCGTGAGCTTAACGGAGG ',
 'GGCATACACTCGCTATGTCGATAACAACTT',
 'CTGTGGCCCTGATGGCTACCCTCTTGAGTG',
 'CATTAAAGACCTTCTAGCAC GTGCTGGTA',
 'AAGCTTCATGCACTTTGTCCGAACAACTGG',
 'ACTTTATTGACACTAAGAGGGGTGTATACT',
 'GCTGCCGTGAA CATGAGCATGAAATTGCT',
 'TGGTACACGGAACGTTCTGAAAAGAGCTAT',
 'GAATTGCAGACACCTTTTGAAATTAAATTG',
 'GC AAAGAAATTTGACATCTTCAATGGGGA',
 'ATGTCCAAATTTTGTATTTCCCTTAAATTC',
 'CATAATCAAGACTATTCAACCAA GGGTTG',
 'AAAAGAAAAAGCTTGATGGCTTTATGGGTA',
 'GAATTCGATCTGTCTATCCAGTTGCGTCAC',
 'CAAATGAATGCAAC CAAATGTGCCTTTCA',
 'ACTCTCATGAAGTGTGATCATTGTGGTGAA',
 'ACTTCATGGCAGACGGGCGATTTTGTTAAA',
 'GCCAC TTGCGAATTTTGTGGCACTGAGAA',
 'TTTGACTAAAGAAGGTGCCACTACTTGTGG',
 'TTACTTACCCCAAAATGCTGTTGTTA AAA',
 'TTTATTGTCCAGCATGTCACAATTCAGAAG',
 'TAGGACCTGAGCATAGTCTTGCCGAATACC',
 'ATAATGAATCTGGCTTG AAAACCATTCTT',
 'CGTAAGGGTGGTCGCACTATTGCCTTTGGA',
 'GGCTGTGTGTTCTCTTATGTTGGTTGCCAT',
 'AACAAGTG TGCCTATTGGGTTCCACGTGC',
 'TAGCGCTAACATAGGTTGTAACCATACAGG',
 'TGTTGTTGGAGAAGGTTCCGAAGGTCTTA ',
 'ATGACAACCTTCTTGAAATACTCCAAAAAG',
 'AGAAAGTCAACATCAATATTGTTGGTGACT',
 'TTAAACTTAATGAAGAGATC GCCATTATT',
 'TTGGCATCTTTTTCTGCTTCCACAAGTGCT',
 'TTTGTGGAAACTGTGAAAGGTTTGGATTAT',
 'AAAGCATTCAA ACAAATTGTTGAATCCTG',
 'TGGTAATTTTAAAGTTACAAAAGGAAAAGC',
 'TAAAAAAGGTGCCTGGAATATTGGTGAACA',
 'GA AATCAATACTGAGTCCTCTTTATGCAT',
 'TTGCATCAGAGGCTGCTCGTGTTGTACGAT',
 'CAATTTTCTCCCGCACTCTTGAA ACTGCT',
 'CAAAATTCTGTGCGTGTTTTACAGAAGGCC',
 'GCTATAACAATACTAGATGGAATTTCACAG',
 'TATTCACTGAGACT CATTGATGCTATGAT',
 'GTTCACATCTGATTTGGCTACTAACAATCT',
 'AGTTGTAATGGCCTACATTACAGGTGGTGT',
 'TGTTC AGTTGACTTCGCAGTGGCTAACTA',
 'ACATCTTTGGCACTGTTTATGAAAAACTCA',
 'AACCCGTCCTTGATTGGCTTGAAGAG AAG',
 'TTTAAGGAAGGTGTAGAGTTTCTTAGAGAC',
 'GGTTGGGAAATTGTTAAATTTATCTCAACC',
 'TGTGCTTGTGAAATTGT CGGTGGACAAAT',
 'TGTCACCTGTGCAAAGGAAATTAAGGAGAG',
 'TGTTCAGACATTCTTTAAGCTTGTAAATAA',
 'ATTTTTGG CTTTGTGTGCTGACTCTATCA',
 'TTATTGGTGGAGCTAAACTTAAAGCCTTGA',
 'ATTTAGGTGAAACATTTGTCACGCACTCA ',
 'AAGGGATTGTACAGAAAGTGTGTTAAATCC',
 'AGAGAAGAAACTGGCCTACTCATGCCTCTA',
 'AAAGCCCCAAAAGAAATTAT CTTCTTAGA',
 'GGGAGAAACACTTCCCACAGAAGTGTTAAC',
 'AGAGGAAGTTGTCTTGAAAACTGGTGATTT',
 'ACAACCATTAG AACAACCTACTAGTGAAG',
 'CTGTTGAAGCTCCATTGGTTGGTACACCAG',
 'TTTGTATTAACGGGCTTATGTTGCTCGAAA',
 'TC AAAGACACAGAAAAGTACTGTGCCCTT',
 'GCACCTAATATGATGGTAACAAACAATACC',
 'TTCACACTCAAAGGCGGTGCACC AACAAA',
 'GGTTACTTTTGGTGATGACACTGTGATAGA',
 'AGTGCAAGGTTACAAGAGTGTGAATATCAC',
 'TTTTGAACTTGATG AAAGGATTGATAAAG',
 'TACTTAATGAGAAGTGCTCTGCCTATACAG',
 'TTGAACTCGGTACAGAAGTAAATGAGTTCG']

Where exactly are the overlapping k-mers in there?

ADD REPLY
0
Entering edit mode

Dunois you are running only one command on these sequences. Can you please store all these sequences and then run this command? these is a big fasta file, so I have to run on that

ADD REPLY
0
Entering edit mode

Traceback (most recent call last): File "test.py", line 30, in <module> kmer = [seq[x:x+k] for x in range(0, len(seq)-k+1, k)] TypeError: object of type 'float' has no len()

I am getting this error when I used -k+1 in len(seq). Thats why I am saying run the full script

ADD REPLY
0
Entering edit mode

Why are you using pandas to read a simple text file? This is likely where your issues are coming in from - not the kmer calculation.

ADD REPLY
0
Entering edit mode

then how can i read the fasta file? can you please tell me?

ADD REPLY
0
Entering edit mode

You should use biopython, but you can use any standard line-by-line python file readon method as long as you handle the headers etc properly.

ADD REPLY
2
Entering edit mode
2.1 years ago
Dunois ★ 2.8k

Here, try this, this is a full solution. You'll need to have biopython (Bio) and pandas installed via pip or conda for this to work. collections should be available already.

The code:

import pandas as pd
from Bio import SeqIO
from collections import defaultdict

def get_ukmers(seq, k = 30):
    kmers = [seq[x:x+k] for x in range(0, len(seq)-k+1, k)]
    kout = ";".join(str(kmer) for kmer in kmers)
    return(kout)

def extr_nlkmers(inpfa, outfile, k = 30):
    tmpdict = defaultdict(list)

    with open(inpfa) as fas:
        print("Extracting non-overlapping k-mers from file:", str(inpfa))
        for record in SeqIO.parse(fas, "fasta"):
            print("Opening sequence:", record.name)
            tmpdict['header'].append(record.name)
            tmpdict['seq'].append(''.join(str(nuc) for nuc in record.seq))
            tmpdict['nl_kmers'].append(get_ukmers(record.seq, k))
            print("Done handling sequence:", record.name)
        print("All done.")
        fas.close

    df = pd.DataFrame.from_dict(tmpdict)
    print("Data for all sequences written to file:", outfile)
    df.to_csv(outfile)

    return(df)

You can execute it like so:

#Set a path for an input and output file.
#The input file is a FASTA file.
#The output file is a .CSV file
myin = "test.fasta"
myout = "unique_kmers.csv"

#Extracting non-overlapping k-mers and
#writing them to file.
#The data.frame is also retained in the
#environment
df = extr_nlkmers(myin, myout, k = 30)

df
#Output
# Extracting non-overlapping k-mers from file: test.fasta
# Opening sequence: seq0
# Done handling sequence: seq0
# Opening sequence: seq1
# Done handling sequence: seq1
# Opening sequence: seq2
# Done handling sequence: seq2
# Opening sequence: seq3
# Done handling sequence: seq3
# Opening sequence: seq4
# Done handling sequence: seq4
# Opening sequence: seq5
# Done handling sequence: seq5
# Opening sequence: seq6
# Done handling sequence: seq6
# Opening sequence: seq7
# Done handling sequence: seq7
# Opening sequence: seq8
# Done handling sequence: seq8
# Opening sequence: seq9
# Done handling sequence: seq9
# Opening sequence: seq10
# Done handling sequence: seq10
# Opening sequence: seq11
# Done handling sequence: seq11
# Opening sequence: seq12
# Done handling sequence: seq12
# Opening sequence: seq13
# Done handling sequence: seq13
# Opening sequence: seq14
# Done handling sequence: seq14
# Opening sequence: seq15
# Done handling sequence: seq15
# Opening sequence: seq16
# Done handling sequence: seq16
# Opening sequence: seq17
# Done handling sequence: seq17
# Opening sequence: seq18
# Done handling sequence: seq18
# Opening sequence: seq19
# Done handling sequence: seq19
# Opening sequence: seq20
# Done handling sequence: seq20
# Opening sequence: seq21
# Done handling sequence: seq21
# Opening sequence: seq22
# Done handling sequence: seq22
# Opening sequence: seq23
# Done handling sequence: seq23
# Opening sequence: seq24
# Done handling sequence: seq24
# Opening sequence: seq25
# Done handling sequence: seq25
# Opening sequence: seq26
# Done handling sequence: seq26
# Opening sequence: seq27
# Done handling sequence: seq27
# Opening sequence: seq28
# Done handling sequence: seq28
# Opening sequence: seq29
# Done handling sequence: seq29
# Opening sequence: seq30
# Done handling sequence: seq30
# Opening sequence: seq31
# Done handling sequence: seq31
# Opening sequence: seq32
# Done handling sequence: seq32
# Opening sequence: seq33
# Done handling sequence: seq33
# Opening sequence: seq34
# Done handling sequence: seq34
# Opening sequence: seq35
# Done handling sequence: seq35
# All done.
# Data for all sequences written to file: unique_kmers.csv

And you'll also get a CSV table that looks like this:

# header    seq     nl_kmers
# 0     seq0    AAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTC...   AAGGTTTATACCTTCCCAGGTAACAAACCA;ACCAACTTTCGATCT...
# 1     seq1    TGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTAT...   TGTGTGGCTGTCACTCGGCTGCATGCTTAG;TGCACTCACGCAGTA...
# 2     seq2    GAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG...   GAGTAACTCGTCTATCTTCTGCAGGCTGCT;TACGGTTTCGTCCGT...
# 3     seq3    CGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCA...   CGGGTGTGACCGAAAGGTAAGATGGAGAGC;CTTGTCCCTGGTTTC...
# 4     seq4    TTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTG...   TTTACAGGTTCGCGACGTGCTCGTACGTGG;CTTTGGAGACTCCGT...
# 5     seq5    AAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCC...   AAGATGGCACTTGTGGCTTAGTAGAAGTTG;AAAAAGGCGTTTTGC...
# 6     seq6    TCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGG...   TCGGATGCTCGAACTGCACCTCATGGTCAT;GTTATGGTTGAGCTG...
# 7     seq7    TGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCA...   TGGTGAGACACTTGGTGTCCTTGTCCCTCA;TGTGGGCGAAATACC...
# 8     seq8    GTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTC...   GTAATAAAGGAGCTGGTGGCCATAGTTACG;GCGCCGATCTAAAGT...
# 9     seq9    TATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTG...   TATGAAGATTTTCAAGAAAACTGGAACACT;AAACATAGCAGTGGT...
# 10    seq10   GGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGC...   GGCATACACTCGCTATGTCGATAACAACTT;CTGTGGCCCTGATGG...
# 11    seq11   GTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTAT...   GTGCTGGTAAAGCTTCATGCACTTTGTCCG;AACAACTGGACTTTA...
# 12    seq12   CATGAGCATGAAATTGCTTGGTACACGGAACGTTCTGAAAAGAGCT...   CATGAGCATGAAATTGCTTGGTACACGGAA;CGTTCTGAAAAGAGC...
# 13    seq13   AAAGAAATTTGACATCTTCAATGGGGAATGTCCAAATTTTGTATTT...   AAAGAAATTTGACATCTTCAATGGGGAATG;TCCAAATTTTGTATT...
# 14    seq14   GGGTTGAAAAGAAAAAGCTTGATGGCTTTATGGGTAGAATTCGATC...   GGGTTGAAAAGAAAAAGCTTGATGGCTTTA;TGGGTAGAATTCGAT...
# 15    seq15   CAAATGTGCCTTTCAACTCTCATGAAGTGTGATCATTGTGGTGAAA...   CAAATGTGCCTTTCAACTCTCATGAAGTGT;GATCATTGTGGTGAA...
# 16    seq16   TTGCGAATTTTGTGGCACTGAGAATTTGACTAAAGAAGGTGCCACT...   TTGCGAATTTTGTGGCACTGAGAATTTGAC;TAAAGAAGGTGCCAC...
# 17    seq17   AAATTTATTGTCCAGCATGTCACAATTCAGAAGTAGGACCTGAGCA...   AAATTTATTGTCCAGCATGTCACAATTCAG;AAGTAGGACCTGAGC...
# 18    seq18   AAAACCATTCTTCGTAAGGGTGGTCGCACTATTGCCTTTGGAGGCT...   AAAACCATTCTTCGTAAGGGTGGTCGCACT;ATTGCCTTTGGAGGC...
# 19    seq19   TGCCTATTGGGTTCCACGTGCTAGCGCTAACATAGGTTGTAACCAT...   TGCCTATTGGGTTCCACGTGCTAGCGCTAA;CATAGGTTGTAACCA...
# 20    seq20   ATGACAACCTTCTTGAAATACTCCAAAAAGAGAAAGTCAACATCAA...   ATGACAACCTTCTTGAAATACTCCAAAAAG;AGAAAGTCAACATCA...
# 21    seq21   GCCATTATTTTGGCATCTTTTTCTGCTTCCACAAGTGCTTTTGTGG...   GCCATTATTTTGGCATCTTTTTCTGCTTCC;ACAAGTGCTTTTGTG...
# 22    seq22   ACAAATTGTTGAATCCTGTGGTAATTTTAAAGTTACAAAAGGAAAA...   ACAAATTGTTGAATCCTGTGGTAATTTTAA;AGTTACAAAAGGAAA...
# 23    seq23   AATCAATACTGAGTCCTCTTTATGCATTTGCATCAGAGGCTGCTCG...   AATCAATACTGAGTCCTCTTTATGCATTTG;CATCAGAGGCTGCTC...
# 24    seq24   ACTGCTCAAAATTCTGTGCGTGTTTTACAGAAGGCCGCTATAACAA...   ACTGCTCAAAATTCTGTGCGTGTTTTACAG;AAGGCCGCTATAACA...
# 25    seq25   CATTGATGCTATGATGTTCACATCTGATTTGGCTACTAACAATCTA...   CATTGATGCTATGATGTTCACATCTGATTT;GGCTACTAACAATCT...
# 26    seq26   AGTTGACTTCGCAGTGGCTAACTAACATCTTTGGCACTGTTTATGA...   AGTTGACTTCGCAGTGGCTAACTAACATCT;TTGGCACTGTTTATG...
# 27    seq27   AAGTTTAAGGAAGGTGTAGAGTTTCTTAGAGACGGTTGGGAAATTG...   AAGTTTAAGGAAGGTGTAGAGTTTCTTAGA;GACGGTTGGGAAATT...
# 28    seq28   CGGTGGACAAATTGTCACCTGTGCAAAGGAAATTAAGGAGAGTGTT...   CGGTGGACAAATTGTCACCTGTGCAAAGGA;AATTAAGGAGAGTGT...
# 29    seq29   CTTTGTGTGCTGACTCTATCATTATTGGTGGAGCTAAACTTAAAGC...   CTTTGTGTGCTGACTCTATCATTATTGGTG;GAGCTAAACTTAAAG...
# 30    seq30   AAGGGATTGTACAGAAAGTGTGTTAAATCCAGAGAAGAAACTGGCC...   AAGGGATTGTACAGAAAGTGTGTTAAATCC;AGAGAAGAAACTGGC...
# 31    seq31   CTTCTTAGAGGGAGAAACACTTCCCACAGAAGTGTTAACAGAGGAA...   CTTCTTAGAGGGAGAAACACTTCCCACAGA;AGTGTTAACAGAGGA...
# 32    seq32   AACAACCTACTAGTGAAGCTGTTGAAGCTCCATTGGTTGGTACACC...   AACAACCTACTAGTGAAGCTGTTGAAGCTC;CATTGGTTGGTACAC...
# 33    seq33   AAAGACACAGAAAAGTACTGTGCCCTTGCACCTAATATGATGGTAA...   AAAGACACAGAAAAGTACTGTGCCCTTGCA;CCTAATATGATGGTA...
# 34    seq34   AACAAAGGTTACTTTTGGTGATGACACTGTGATAGAAGTGCAAGGT...   AACAAAGGTTACTTTTGGTGATGACACTGT;GATAGAAGTGCAAGG...
# 35    seq35   AAAGGATTGATAAAGTACTTAATGAGAAGTGCTCTGCCTATACAGT...   AAAGGATTGATAAAGTACTTAATGAGAAGT;GCTCTGCCTATACAG...

And here, if what you're worried about are sets of k-mers like AAGGTTTATACCTTCCCAGGTAACAAACCA;ACCAACTTTCGATCTCTTGTAGATCTGTTC (from the very first sequence), these aren't overlapping k-mers. It so happens that the parent sequence has a repeat like so:

"AAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATC"

You're confusing overlaps in the k-mer sequence with overlapping k-mers (overlapping k-mers are generated via a sliding window approach).

ADD COMMENT
0
Entering edit mode
2.1 years ago
Mensur Dlakic ★ 28k

This is not productive, nor is it easy to follow with the ever-shrinking responses pushed into a narrower columns. So please don't make this post a reply, as it may help others reading the wide-again text.

The solution to this problem has been posted days ago. Arguing with someone that their solution doesn't work without actually following the solution would be an instant no-no for me as to whether to continue this discussion. That's why I am grateful for so many generous people on this platform, as the original poster simply refuses to read what is in there.

As to reading FASTA files without pandas, there are dedicated libraries such as BioPython. Even the casual search of this website - not to mention the whole internet - will identify many ways of reading FASTA without pandas.

ADD COMMENT

Login before adding your answer.

Traffic: 2089 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6