Question

How To Compare All The K-Mers Of A Given Length With All The K-Length Sub Strings Of A Dna Sequence?

5

Entering edit mode

12.7 years ago

Ceilia ▴ 50

if you have a set of k-mers of a given length, how you can compare each k-mer with each k-length sub string of a DNA sequence?

For example,

k-mers, (for k=4)

AAAA AAAT AAAG AAAC . . . . TTTT

sequence =ATGCCCATCAAAGGCTCATTGCGACC

• 20k views

ADD COMMENT • link updated 12.7 years ago by Damian Kao 16k • written 12.7 years ago by Ceilia ▴ 50

score 8 · Answer 1 · 2012-03-06

8

Entering edit mode

12.7 years ago

Sean Davis 27k

In R, after installing the Biostrings Bioconductor package:

library(Biostrings)
s = DNAString('ATGCCCATCAAAGGCTCATTGCGACC')
kmercounts = oligonucleotideFrequency(s,4)
head(kmercounts)
kmercounts[kmercounts>0]

The last line above returns:

AAAG AAGG AGGC ATCA ATGC ATTG CAAA CATC CATT CCAT CCCA CGAC CTCA GACC GCCC GCGA 
   1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
GCTC GGCT TCAA TCAT TGCC TGCG TTGC 
   1    1    1    1    1    1    1

If you want to know the kmer count for a specific kmer, you can do this:

x['AAAG']

which returns:

AAAG 
   1

ADD COMMENT • link 12.7 years ago by Sean Davis 27k

0

Entering edit mode

This is great. Thanks

ADD REPLY • link 12.7 years ago by Gjain 5.8k

0

Entering edit mode

Can we use this in python?

ADD REPLY • link 7.9 years ago by syrttgumpwork • 0

score 4 · Answer 2 · 2012-03-06

4

Entering edit mode

12.7 years ago

Alastair Kerr 5.3k

Jellyfish is a great program for this purpose.

ADD COMMENT • link 12.7 years ago by Alastair Kerr 5.3k

1

Entering edit mode

Jellyfish has pretty much streamlined it, as far as I've heard

ADD REPLY • link 12.7 years ago by Lee Katz ★ 3.2k

score 3 · Answer 3 · 2012-03-06

3

Entering edit mode

12.7 years ago

Damian Kao 16k

If it's a short sequence you can do it in python like this:

seq = 'AGATAGATAGACACAGAAATGGGACCACAC'
kmers = {}
k = 4
for i in range(len(seq) - k + 1):
   kmer = seq[i:i+k]
   if kmers.has_key(kmer):
      kmers[kmer] += 1
   else:
      kmers[kmer] = 1

for kmer, count in kmers.items():
   print kmer + "\t" + str(count)

If it's longer, like a whole genome, I would use jellyfish like Alastair suggested.

if you want a sorted list of kmers you can append this to the above dode:

import operator
sortedKmer = kmers.items()
sortedKmer.sort(key = operator.itemgetter(1), reverse = True)
for item in sortedKmer:
   print item[0] + "\t" + str(item[1])

ADD COMMENT • link 12.7 years ago by Damian Kao 16k

1

Entering edit mode

thanks.Can you please mention how to do it in C?

ADD REPLY • link 12.7 years ago by User 8217 ▴ 10

0

Entering edit mode

jellyfish is GNU C

ADD REPLY • link 10.3 years ago by hobsonlane • 0

0

Entering edit mode

You can use a collections.Counter to more efficiently count the kmers after they are generated by your substring iterator.

ADD REPLY • link 10.3 years ago by hobsonlane • 0