Question

Wrong Alphabet In Biopython.Motifs When Trying To Pssm.Calculate

1

Entering edit mode

11.8 years ago

Jelena_bioinf ▴ 40

I'm trying to find motif occurences in the upstream:

upstream = Seq(mysequence, IUPAC.unambiguous_dna)
pssm.calculate(upstream)

get error:

ValueError: Wrong alphabet! Use only with DNA motifs

if I check alphabets:

>>>pssm.alphabet
IUPACUnambiguousDNA()
>>>upstream.alphabet
IUPACUnambiguousDNA()

The source code where is fails:

  if self.alphabet!=IUPAC.unambiguous_dna:
        raise ValueError("Wrong alphabet! Use only with DNA motifs")

What could I do wrong? What's really the difference between IUPACUnambiguousDNA() and IUPAC.unambiguous_dna?

biopython motif • 4.2k views

ADD COMMENT • link updated 11.8 years ago by Istvan Albert 102k • written 11.8 years ago by Jelena_bioinf ▴ 40

score 4 · Answer 1 · 2013-06-04

4

Entering edit mode

11.8 years ago

Istvan Albert 102k

The problem is with biopython - it tests instances for equality instead of testing them for having the same class.

IUPACUnambiguousDNA() and IUPAC.unambiguous_dna may actually represent different instances of the same class you can see that by doing something like:

print id(pssm.alphabet)
print id(upstream.alphabet)

These will not test as being equal when checking with the = operator.

When you create the pssm object make sure to use the the IUPAC.unambiguous_dna in the constructor.

ADD COMMENT • link 11.8 years ago by Istvan Albert 102k

2

Entering edit mode

I agree with your diagnosis and the suggested workaround Istvan. We'll try to fix this... http://lists.open-bio.org/pipermail/biopython-dev/2013-June/010639.html - Thanks!

ADD REPLY • link 11.8 years ago by Peter 6.0k

0

Entering edit mode

Indeed

   >>id(upstream.alphabet)
   4414891728
   >>id(pssm.alphabet)
   4428032464

I'm not sure I use alphabet in the constructor properly:

   instances = []
           for line in alignment:
               instances.append(Seq(line, IUPAC.unambiguous_dna))
           m = motifs.create(instances)
           pwm = m.counts.normalize(pseudocounts=PSEUDOCOUNT)
           background = {'A':0.18,'C':0.32,'G':0.32,'T':0.18}
           pssm = pwm.log_odds(background)

What I also tried to do and it didn't work:

   upstreamList = [Seq(u, pssm.alphabet) for u in upstreamList]

previously it was

   upstreamList = [Seq(u, IUPAC.unambiguous_dna) for u in upstreamList]

The likely reason is that I create pssm in one run and cPickle it and then unpickle when I need to apply. It works fine if I create pssm in the same run as I apply, but constructing matrix takes time, so I'd like to figure out how to pickle it if it's possible at all.

ADD REPLY • link 11.8 years ago by Jelena_bioinf ▴ 40

0

Entering edit mode

Pickling and unpickling will result in a new instance of the alphabet class, and you'll hit the bug again. You'll have to wait for the next Biopython release where this is fixed, or apply the fix locally. See the mailing list discussion linked to above.

ADD REPLY • link 11.8 years ago by Peter 6.0k