Hi there, I currently have the code stated below. The output I currently have it set to write out is not what I'm looking for. However, this has been the only way I have successfully run the script. I would like to remove the sequences from a fasta file that have a hamming distance of 50 or more. I tried to use an additional for loop and write the output file with SeqIO but the way 'seqs' is currently written, it was not defined. I created an empty list and tried to append the results.
from Bio import SeqIO
REFERENCE_SEQ = 'ATGC.....'
def hamming_distance(seq):
result = sum(b1 != b2 for b1, b2 in zip(REFERENCE_SEQ, seq))
return result
if __name__ == '__main__':
filename = open('test.fasta', 'r')
seqs = [record for record in SeqIO.parse(filename, 'fasta')]
counted = list(map(lambda x: (x, hamming_distance(x)), seqs))
results = list(filter(lambda x: x[1] < 50, counted))
print(*results, sep='\n', file=open("output.fasta", "a"))
The current output is as follows:
(SeqRecord(seq=Seq('TCGCCCTTGCTCACCATGCTAGGTTCTCCTTCTTAATCTAGAATGCAATATACT...CGG',SingleLetterAlphabet()), id='rev_comp_consensus', name='rev_comp_consensus', description='rev_comp_consensus', dbxrefs=[]), 7)
(SeqRecord(seq=Seq('TCGCCCTTGCTCACCATGCTAGGTTCTCCTTCTTAATCTAGAATGTAATATACT...CCG', SingleLetterAlphabet()), id='1', name='1', description='1', dbxrefs=[]), 7)
(SeqRecord(seq=Seq('TCGCCCTTGCTCACCATGCTAGGTTCTCCTTCTTAATCTAGAATGCAATATACT...GGT', SingleLetterAlphabet()), id='2_also7', name='2_also7', description='2_also7', dbxrefs=[]), 7)
Etc.
Please let me know if you have any ideas to
1. remove the sequences with a specified hamming distance
2. write the remaining sequences to a fasta file
Thank you so much.
Are the three final lines of code indented properly?
oops, must've missed that when I wrote the question. I just fixed their indentation! Thanks.
Does this work?
The logic definitely makes sense to me, I adjusted it a little for errors but the print line was extremely helpful! I'm getting a blank output right now though, when I did a matrix for the sequences, there were plenty (500,000 sequences per file) with less than 50. Additionally, dh is being defined.
Thanks again!
Could you post the corrections you've made, please?