I have a list of 20 bp sequences stored in a list called known. I have a fastq file called A.fastq, from which I cut out the relevant 20 bp region and compare against the known list. I would like to use Hamming distance as a metric for choosing the correct match i.e for every fastq read, I want to find the string in the known list which has is the least Hamming distance apart and update a counter good_count only if the least Hamming distance is <= 3. This is my code, but this does not account for Hamming distance criteria. Can you suggest what I should do?
import Bio
from Bio import SeqIO
known = set()
for s in Bio.SeqIO.parse("lib.fa","fasta"):
known.add(str(s.seq[10:20]))
good_count = 0
for r in Bio.SeqIO.parse("read1.fastq","fastq"):
if str(r.seq[10:20]) in known:
good_count += 1
print good_count
Specifically, I want help at the "if" statement to compare string by string from the list "known"
Look into the python
distance
package.