Can anyone help me .. how to find k-mer of 4 from a fasta file in Python3 .. Thanks in advance i'm a beginner ...
Can anyone help me .. how to find k-mer of 4 from a fasta file in Python3 .. Thanks in advance i'm a beginner ...
I wrote a command-line k-mer counter called kmer-counter
that will output results in a form that a Python script can consume: https://github.com/alexpreynolds/kmer-counter
You can grab, build and install it like so:
$ git clone https://github.com/alexpreynolds/kmer-counter.git
$ cd kmer-counter
$ make
$ cp kmer-counter /usr/local/bin
Once the binary is in your path, you might use it in Python like so:
k = 6
fastaFile = '/path/to/some/seqs.fa'
kmerCmd = 'kmer-counter --fasta --k=%d %s' % (k, fastaFile)
try:
output = subprocess.check_output(kmerCmd, shell=True)
result = {}
for line in output.splitlines():
(header, counts) = line.strip().split('\t')
header = header[1:]
kmers = dict((k,int(v)) for (k,v) in [d.split(':') for d in counts.split(' ')])
result[header] = kmers
sys.stdout.write("%s" % (str(result)))
except subprocess.CalledProcessError as error:
sys.stderr.write("%s" % (str(error)))
Replace k
with your desired value (4, whatever).
Given example FASTA like this:
>foo
TTAACG
>bar
GTGGAAGTTCTTAGGGCATGGCAAAGAGTCAGAATTTGAC
For k=6, you would get an iterable Python dictionary like this:
{'foo': {'TTAACG': 1, 'CGTTAA': 1}, 'bar': {'GTTCTT': 1, 'AGAACT': 1, 'GAGTCA': 1, 'ATGGCA': 1, 'GAACTT': 1, 'ATTCTG': 1, 'CTAAGA': 1, 'CTTCCA': 1, 'ATTTGA': 1, 'GGAAGT': 1, 'AGGGCA': 1, 'CCTAAG': 1, 'CTCTTT': 1, 'AATTTG': 1, 'TCTGAC': 1, 'TTTGCC': 1, 'CTTAGG': 1, 'TTTGAC': 1, 'GAAGTT': 1, 'CCCTAA': 1, 'AGAATT': 1, 'AGTCAG': 1, 'CTGACT': 1, 'TCTTAG': 1, 'CGTTAA': 1, 'GTGGAA': 1, 'TGCCAT': 1, 'ACTCTT': 1, 'GGGCAT': 1, 'TTAGGG': 1, 'CTTTGC': 1, 'TGGAAG': 1, 'GACTCT': 1, 'CATGCC': 1, 'GCAAAG': 1, 'AAATTC': 1, 'GTCAAA': 1, 'TGACTC': 1, 'TAGGGC': 1, 'AAGTTC': 1, 'ATGCCC': 1, 'TCAAAT': 1, 'CAAAGA': 1, 'AACTTC': 1, 'GTCAGA': 1, 'CAAATT': 1, 'TAAGAA': 1, 'CATGGC': 1, 'AAGAAC': 1, 'AAGAGT': 1, 'TCTTTG': 1, 'TTCCAC': 1, 'TGGCAA': 1, 'GGCAAA': 1, 'AGTTCT': 1, 'AGAGTC': 1, 'TCAGAA': 1, 'GAATTT': 1, 'AAAGAG': 1, 'TGCCCT': 1, 'CCATGC': 1, 'GGCATG': 1, 'TTGCCA': 1, 'CAGAAT': 1, 'AATTCT': 1, 'GCATGG': 1, 'ACTTCC': 1, 'TTCTTA': 1, 'GCCATG': 1, 'GCCCTA': 1, 'TTCTGA': 1}}
For a fully-fleshed out demonstration, see: https://github.com/alexpreynolds/kmer-counter/blob/master/test/kmer-test.py
Use jellyfish
This is your test.fa
>test
TAATGCCATGGGATGTT
You want to count all 3-mers
jellyfish count -m 3 -s 100000 test.fa -o test.jf
jellyfish dump -c test.jf
The last command returns :
TGT 1
GAT 1
GGG 1
GGA 1
CAT 1
TGC 1
TAA 1
GCC 1
CCA 1
GTT 1
TGG 1
ATG 3
AAT 1
Try modifying this code
string=raw_input('Enter string:')
kmer_size=input('Kmer_size: ')
print 'Total kmers: ', len(string)-kmer_size+1
for i in range(0,len(string)-kmer_size+1):
string=list(string)
kmer=''.join(string[i:i+kmer_size])
print kmer
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Jellyfish would only work DNA, not proteins.