How To Compare All The K-Mers Of A Given Length With All The K-Length Sub Strings Of A Dna Sequence?
3
5
Entering edit mode
12.7 years ago
Ceilia ▴ 50

if you have a set of k-mers of a given length, how you can compare each k-mer with each k-length sub string of a DNA sequence?

For example,

k-mers, (for k=4)

AAAA AAAT AAAG AAAC . . . . TTTT

sequence =ATGCCCATCAAAGGCTCATTGCGACC

• 20k views
ADD COMMENT
8
Entering edit mode
12.7 years ago

In R, after installing the Biostrings Bioconductor package:

library(Biostrings)
s = DNAString('ATGCCCATCAAAGGCTCATTGCGACC')
kmercounts = oligonucleotideFrequency(s,4)
head(kmercounts)
kmercounts[kmercounts>0]

The last line above returns:

AAAG AAGG AGGC ATCA ATGC ATTG CAAA CATC CATT CCAT CCCA CGAC CTCA GACC GCCC GCGA 
   1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
GCTC GGCT TCAA TCAT TGCC TGCG TTGC 
   1    1    1    1    1    1    1

If you want to know the kmer count for a specific kmer, you can do this:

x['AAAG']

which returns:

AAAG 
   1
ADD COMMENT
0
Entering edit mode

This is great. Thanks

ADD REPLY
0
Entering edit mode

Can we use this in python?

ADD REPLY
4
Entering edit mode
12.7 years ago

Jellyfish is a great program for this purpose.

ADD COMMENT
1
Entering edit mode

Jellyfish has pretty much streamlined it, as far as I've heard

ADD REPLY
3
Entering edit mode
12.7 years ago

If it's a short sequence you can do it in python like this:

seq = 'AGATAGATAGACACAGAAATGGGACCACAC'
kmers = {}
k = 4
for i in range(len(seq) - k + 1):
   kmer = seq[i:i+k]
   if kmers.has_key(kmer):
      kmers[kmer] += 1
   else:
      kmers[kmer] = 1

for kmer, count in kmers.items():
   print kmer + "\t" + str(count)

If it's longer, like a whole genome, I would use jellyfish like Alastair suggested.

if you want a sorted list of kmers you can append this to the above dode:

import operator
sortedKmer = kmers.items()
sortedKmer.sort(key = operator.itemgetter(1), reverse = True)
for item in sortedKmer:
   print item[0] + "\t" + str(item[1])
ADD COMMENT
1
Entering edit mode

thanks.Can you please mention how to do it in C?

ADD REPLY
0
Entering edit mode

jellyfish is GNU C

ADD REPLY
0
Entering edit mode

You can use a collections.Counter to more efficiently count the kmers after they are generated by your substring iterator.

ADD REPLY

Login before adding your answer.

Traffic: 1420 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6