Finding Kmers From Fasta File
3
3
Entering edit mode
7.4 years ago

Can anyone help me .. how to find k-mer of 4 from a fasta file in Python3 .. Thanks in advance i'm a beginner ...

sequencing RNA-Seq gene • 8.5k views
ADD COMMENT
3
Entering edit mode
7.4 years ago

I wrote a command-line k-mer counter called kmer-counter that will output results in a form that a Python script can consume: https://github.com/alexpreynolds/kmer-counter

You can grab, build and install it like so:

$ git clone https://github.com/alexpreynolds/kmer-counter.git
$ cd kmer-counter
$ make
$ cp kmer-counter /usr/local/bin

Once the binary is in your path, you might use it in Python like so:

k = 6
fastaFile = '/path/to/some/seqs.fa'
kmerCmd = 'kmer-counter --fasta --k=%d %s' % (k, fastaFile)
try:
    output = subprocess.check_output(kmerCmd, shell=True)
    result = {}
    for line in output.splitlines():
        (header, counts) = line.strip().split('\t')
        header = header[1:]
        kmers = dict((k,int(v)) for (k,v) in [d.split(':') for d in counts.split(' ')])
        result[header] = kmers
    sys.stdout.write("%s" % (str(result)))
except subprocess.CalledProcessError as error:
    sys.stderr.write("%s" % (str(error)))

Replace k with your desired value (4, whatever).

Given example FASTA like this:

>foo
TTAACG
>bar
GTGGAAGTTCTTAGGGCATGGCAAAGAGTCAGAATTTGAC

For k=6, you would get an iterable Python dictionary like this:

{'foo': {'TTAACG': 1, 'CGTTAA': 1}, 'bar': {'GTTCTT': 1, 'AGAACT': 1, 'GAGTCA': 1, 'ATGGCA': 1, 'GAACTT': 1, 'ATTCTG': 1, 'CTAAGA': 1, 'CTTCCA': 1, 'ATTTGA': 1, 'GGAAGT': 1, 'AGGGCA': 1, 'CCTAAG': 1, 'CTCTTT': 1, 'AATTTG': 1, 'TCTGAC': 1, 'TTTGCC': 1, 'CTTAGG': 1, 'TTTGAC': 1, 'GAAGTT': 1, 'CCCTAA': 1, 'AGAATT': 1, 'AGTCAG': 1, 'CTGACT': 1, 'TCTTAG': 1, 'CGTTAA': 1, 'GTGGAA': 1, 'TGCCAT': 1, 'ACTCTT': 1, 'GGGCAT': 1, 'TTAGGG': 1, 'CTTTGC': 1, 'TGGAAG': 1, 'GACTCT': 1, 'CATGCC': 1, 'GCAAAG': 1, 'AAATTC': 1, 'GTCAAA': 1, 'TGACTC': 1, 'TAGGGC': 1, 'AAGTTC': 1, 'ATGCCC': 1, 'TCAAAT': 1, 'CAAAGA': 1, 'AACTTC': 1, 'GTCAGA': 1, 'CAAATT': 1, 'TAAGAA': 1, 'CATGGC': 1, 'AAGAAC': 1, 'AAGAGT': 1, 'TCTTTG': 1, 'TTCCAC': 1, 'TGGCAA': 1, 'GGCAAA': 1, 'AGTTCT': 1, 'AGAGTC': 1, 'TCAGAA': 1, 'GAATTT': 1, 'AAAGAG': 1, 'TGCCCT': 1, 'CCATGC': 1, 'GGCATG': 1, 'TTGCCA': 1, 'CAGAAT': 1, 'AATTCT': 1, 'GCATGG': 1, 'ACTTCC': 1, 'TTCTTA': 1, 'GCCATG': 1, 'GCCCTA': 1, 'TTCTGA': 1}}

For a fully-fleshed out demonstration, see: https://github.com/alexpreynolds/kmer-counter/blob/master/test/kmer-test.py

ADD COMMENT
2
Entering edit mode
6.5 years ago
sacha ★ 2.4k

Use jellyfish

This is your test.fa

>test
TAATGCCATGGGATGTT

You want to count all 3-mers

    jellyfish count -m 3 -s 100000 test.fa  -o test.jf 
    jellyfish dump -c test.jf

The last command returns :

TGT 1
GAT 1
GGG 1
GGA 1
CAT 1
TGC 1
TAA 1
GCC 1
CCA 1
GTT 1
TGG 1
ATG 3
AAT 1
ADD COMMENT
0
Entering edit mode

Jellyfish would only work DNA, not proteins.

ADD REPLY
0
Entering edit mode
6.5 years ago

Try modifying this code

string=raw_input('Enter string:') 
kmer_size=input('Kmer_size: ')

print 'Total kmers: ', len(string)-kmer_size+1

for i in range(0,len(string)-kmer_size+1):
    string=list(string)
    kmer=''.join(string[i:i+kmer_size])
    print kmer
ADD COMMENT

Login before adding your answer.

Traffic: 1867 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6