I have a FASTQ file with a lot of reads. I expect sets of identical sequences: in fact I will be counting for occurrences of each unique sequence.
I am using Python and Biopython, and am trying to optimize this problem for a large file. I was wondering if there are any suggestions on how to do this?
What I have so far includes a fast Biopython iterator, and MD5 hashes
for title,seq,quals in FastqGeneralIterator(file_read_handle) :
seq_digest = md5.new(seq).digest
if seq_digest in list_digest:
...
else
list_digest.append(seq_digest)
...
Is there any other technique for searching for exact sequence matches which might be more efficient?
Thanks very much.
you may want to check for reverse complement as well