if you have a set of k-mers of a given length, how you can compare each k-mer with each k-length sub string of a DNA sequence?
For example,
k-mers, (for k=4)
AAAA AAAT AAAG AAAC . . . . TTTT
sequence =ATGCCCATCAAAGGCTCATTGCGACC
if you have a set of k-mers of a given length, how you can compare each k-mer with each k-length sub string of a DNA sequence?
For example,
k-mers, (for k=4)
AAAA AAAT AAAG AAAC . . . . TTTT
sequence =ATGCCCATCAAAGGCTCATTGCGACC
In R, after installing the Biostrings Bioconductor package:
library(Biostrings)
s = DNAString('ATGCCCATCAAAGGCTCATTGCGACC')
kmercounts = oligonucleotideFrequency(s,4)
head(kmercounts)
kmercounts[kmercounts>0]
The last line above returns:
AAAG AAGG AGGC ATCA ATGC ATTG CAAA CATC CATT CCAT CCCA CGAC CTCA GACC GCCC GCGA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
GCTC GGCT TCAA TCAT TGCC TGCG TTGC
1 1 1 1 1 1 1
If you want to know the kmer count for a specific kmer, you can do this:
x['AAAG']
which returns:
AAAG
1
Jellyfish is a great program for this purpose.
If it's a short sequence you can do it in python like this:
seq = 'AGATAGATAGACACAGAAATGGGACCACAC'
kmers = {}
k = 4
for i in range(len(seq) - k + 1):
kmer = seq[i:i+k]
if kmers.has_key(kmer):
kmers[kmer] += 1
else:
kmers[kmer] = 1
for kmer, count in kmers.items():
print kmer + "\t" + str(count)
If it's longer, like a whole genome, I would use jellyfish like Alastair suggested.
if you want a sorted list of kmers you can append this to the above dode:
import operator
sortedKmer = kmers.items()
sortedKmer.sort(key = operator.itemgetter(1), reverse = True)
for item in sortedKmer:
print item[0] + "\t" + str(item[1])
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This is great. Thanks
Can we use this in python?