So this interested me and I spent some time hacking at it today (emphasis on hack
).
NB - As Biopython doesn't implement this behaviour, there is probably a very good reason for it. That said, I think hashing the sequences is legit, and the odds of a hash collision are very low (as I've used the whole data structure representation).
Basically, I elected to subclass the SeqRecord
, and extend it with the hash as it's built, and make that hash the basis of the equality check. In essence this means 2 things: firstly, you need to define the hash, and the more unique info it can be built from, the less likely it is you'll get a collision; secondly, you need to (re)implement what it means for two objects to be the same/equal (in python this is the __eq__
method).
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import pprint
import sys
import hashlib
import random
class MD5Record(SeqRecord):
def __init__(self, seqrecord):
self._seq = seqrecord._seq
self.id = seqrecord.id
self.name = seqrecord.name
self.description = seqrecord.description
self.dbxrefs = seqrecord.dbxrefs
self.md5 = hashlib.md5(pprint.pformat(seqrecord).encode('utf-8')).hexdigest()
def __repr__(self):
return "{0}(seq={1!r}, id={2!r}, name={3!r}, description={4!r}, dbxrefs={5!r}, md5={6!r})".format(
self.__class__.__name__,
self.seq, self.id, self.name,
self.description, self.dbxrefs,
self.md5)
def __eq__(self, other):
return self.md5 == other.md5
def __ne__(self, other):
return self.md5 != other.md5
recs = list(SeqIO.parse(sys.argv[1], 'fasta'))
hashed_recs = []
for rec in recs:
pprint.pprint(rec)
hashed = MD5Record(rec)
pprint.pprint(hashed)
hashed_recs.append(hashed)
try:
recs[0] == recs[0] # True in theory
recs[0] == recs[1] # False, in theory
except NotImplementedError:
print("BioPython won't let you compare these")
print(hashed_recs[0] == hashed_recs[0]) # True
print(hashed_recs[0] == hashed_recs[1]) # False
print(hashed_recs[0] != hashed_recs[1]) # True
# Pick an entry to index at random
random = random.choice(hashed_recs)
index = hashed_recs.index(random)
print(index)
Notice I've also redefined the __repr__
to include the md5 for a given record but this is optional. In order for inequality checks to work, you also have to define __ne__
as above.
The key to being able to index
the objects in a list, is the __eq__
method, which now works, as equality for the purposes of indexing is defined as the hashes being the same.
Given this input,
$ cat Genes1.fasta
>RandomSequence_ruL79349EujgaR6sv3fehXYxjIUtHQ5C
TGGGAATTCTGCGTAGTCTCGTGACCGAACGCCGCCTTGAAGTGCGCGCCGCTGTACGATGACCGTCAATATTGATCCGT
>RandomSequence_iEjYsbeuYSY84hj8hSt7C0ndLJfPzwBh
AAGTGCGGTGATACCGACACCCAGCGCGCTTTGTACTCATGTTCATGACAAGGTTGTTTAGATGAAATCCCATGGACTAC
>RandomSequence_ewuch2NIsLWfmo5ir2Tz5if26iKSLHU2
CAAACTGTGTCACTTGGTTTTGGGGAAAGCCGTCAAATCCCTGTATTCGCTATCCATCAAATGAGACTGGACTACCCCAC
This output is obtained (truncated the Record displays for character count):
SeqRecord(seq=Seq('TGGGAATTCTGCGTAG...CGT', SingleLetterAlphabet()), id='RandomSequence_ruL79349EujgaR6sv3fehXYxjIUtHQ5C', ... , dbxrefs=[])
MD5Record(seq=Seq('TGGGAATTCTGCGTAG...CGT', SingleLetterAlphabet()), id='RandomSequence_ruL79349EujgaR6sv3fehXYxjIUtHQ5C', ... , dbxrefs=[], md5='e6542eb6cd69eae9993e68c4bd5696ae')
<<<truncated>>>
BioPython won't let you compare these
True
False
True
1
(Note the 1
will change from run to run, as its a random choice.)
As the error suggests, Biopython doesn't allow you to directly compare two SeqRecords as there's not a meaningful way to compare them. You need to choose an attribute (e.g. its sequence or its ID) to compare between entries.
You could get a bit hacky with it and subclass the
SeqRecord
class, and then define.__eq__
directly yourself, but otherwise unfortunately I think you're going to need to approach this in a different way.