I have multiplexed data from a RADseq experiment (digestion with ApeKI). Each sample has a tag. For each read, I want to check if the tag occurs next to what remains of the ApeKI motif after the cut. For instance, individual 1 has tag ATAT and ApeKI cuts at G/CWGC (where W is A or T in the IUPAC code). I therefore should see something like "ATATCAGC..." at the beginning of my read. What would be the best way to do this with Biopython?
My attempt 1: From the cmd-line, I ask the user to provide the following information "ApeKI=G/CWGC" and then I parse it internally. However, as testing for sequence equality is a difficult matter (it's contexte dependent), Biopython doesn't provide much support for it:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> x = Seq("CWGC", IUPAC.ambiguous_dna)
>>> y = Seq("CAGC", IUPAC.ambiguous_dna)
>>> x == y
...<warning>...
False
>>> str(x) == str(y)
False
My attempt 2: another way is to create a motif:
>>> from Bio import motifs
>>> m = motifs.create([Seq("CAGC"), Seq("CTGC")])
>>> m.degenerate_consensus
Seq('CWGC', IUPACAmbiguousDNA())
>>> "CAGC" in [str(i) for i in m.instances]
True
However, if the user provides me with "ApeKI=G/CWGC", I need to know that W is A or T in the IUPAC code. But it seems that this information is not in Biopython. I can only get a list of letters:
>>> IUPAC.ambiguous_dna.letters
'GATCRYWSMKHBVDN'
Would it be possible to add a dictionary as attribute, so that IUPAC.ambiguous_dna.dict["W"] would return ["A", "T"] for instance?
(I edited the first code block to reflect Peter's comment.)
Could you paste the actual code and output? Your imports are inconsistent with the code. Also didn't you get a warning about comparison? Perhaps your Biopython is very old?
@Peter sorry, you're right, I updated my code block