Changing output format for hamming distance of FASTA sequences (using Biopython)
1
0
Entering edit mode
4.3 years ago
nameuser ▴ 30

Hi there, I currently have the code stated below. The output I currently have it set to write out is not what I'm looking for. However, this has been the only way I have successfully run the script. I would like to remove the sequences from a fasta file that have a hamming distance of 50 or more. I tried to use an additional for loop and write the output file with SeqIO but the way 'seqs' is currently written, it was not defined. I created an empty list and tried to append the results.

from Bio import SeqIO
REFERENCE_SEQ = 'ATGC.....'

def hamming_distance(seq):
    result = sum(b1 != b2 for b1, b2 in zip(REFERENCE_SEQ, seq))
    return result


 if __name__ == '__main__':
     filename = open('test.fasta', 'r')
    seqs = [record for record in SeqIO.parse(filename, 'fasta')]

     counted = list(map(lambda x: (x, hamming_distance(x)), seqs))
     results = list(filter(lambda x: x[1] < 50, counted))
     print(*results, sep='\n', file=open("output.fasta", "a"))

The current output is as follows:

(SeqRecord(seq=Seq('TCGCCCTTGCTCACCATGCTAGGTTCTCCTTCTTAATCTAGAATGCAATATACT...CGG',SingleLetterAlphabet()), id='rev_comp_consensus', name='rev_comp_consensus', description='rev_comp_consensus', dbxrefs=[]), 7)
(SeqRecord(seq=Seq('TCGCCCTTGCTCACCATGCTAGGTTCTCCTTCTTAATCTAGAATGTAATATACT...CCG',   SingleLetterAlphabet()), id='1', name='1', description='1', dbxrefs=[]), 7)
(SeqRecord(seq=Seq('TCGCCCTTGCTCACCATGCTAGGTTCTCCTTCTTAATCTAGAATGCAATATACT...GGT',   SingleLetterAlphabet()), id='2_also7', name='2_also7', description='2_also7', dbxrefs=[]), 7)

Etc.

Please let me know if you have any ideas to

1. remove the sequences with a specified hamming distance

2. write the remaining sequences to a fasta file

Thank you so much.

fasta biopython python hamming_distance sequence • 2.2k views
ADD COMMENT
0
Entering edit mode

Are the three final lines of code indented properly?

ADD REPLY
0
Entering edit mode

oops, must've missed that when I wrote the question. I just fixed their indentation! Thanks.

ADD REPLY
0
Entering edit mode

Does this work?

with open("out.fasta", "w") as f:
    for record in SeqIO.parse(open("test.fasta"):
        dh = hamming(str(record.seq))
        if dh < 50:
            print(">{id}\n{seq}".format(id=record.id, seq=record.seq), file=f)
ADD REPLY
0
Entering edit mode

The logic definitely makes sense to me, I adjusted it a little for errors but the print line was extremely helpful! I'm getting a blank output right now though, when I did a matrix for the sequences, there were plenty (500,000 sequences per file) with less than 50. Additionally, dh is being defined.

Thanks again!

ADD REPLY
0
Entering edit mode

Could you post the corrections you've made, please?

ADD REPLY
0
Entering edit mode
 if __name__ == '__main__':
    with open(filename + "out.fasta", "w") as f:
        for record in (SeqIO.parse(filename, 'fasta')):
          if hamming_distance(str(record.seq)) < 50 :
                print(">{id}\n{seq}".format(id=record.id, seq=record.seq), file=f)
ADD REPLY
3
Entering edit mode
4.3 years ago
Joe 21k

I assume this is a follow-on from this thread? C: Counting mismatched characters from sequences from a fasta file in comparison to

If so, I gave you workable code, you just need to modify the output format and/or use SeqIO.write() as I suggested at the time.

from Bio import SeqIO
import sys

def hamming_distance(s1, s2):
    """Return the Hamming distance between equal-length sequences"""
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

similar_seqs = []
for i, record in enumerate(SeqIO.parse(sys.argv[1], 'fasta')):
    if i == 0:
        reference = record
    if hamming_distance(reference.seq, record.seq) < 50:
        similar_seqs.append(record)

with open("similar_seqs.fasta", "w") as output_handle:
    SeqIO.write(similar_seqs, output_handle, "fasta")

Input:

Note, I've added a modified entry at the end with 3 residues altered

>Zebrafish
ESLLRFGLRSDLDFRLSLNGKEDLLDTGQSLSSCGVVSGDLISVILPASSQTSSAAHQTH
TDQQSSQECVDLQQDCMDQQQQQEQECVCAAAPPLLCCEAEDGLLPLALERLLDSSTCRS
PSDCLMLALHLLLLETGFIPQGGAVSSGEMPIGWQAAGVFRLQYVHPLLENSLVSVVAVP
MGQTLVINAVLKMETSLENSRKLLLKPDEYVTAWTGGSSGVVYRDLRRLSRLVRDQLVYP
LMATARQALGLPLLFGLPVLPPELLLRLLRLLDVRSLVSLSAVCRHLNTATHDASLWRHL
LHRDFRVSFPAHRDTDWRELYKQKYRQRAARRGRHWFYPPPISPLIPFPSSPALYPPGII
GDYDQMPILPRPRFHPIGPLPGMSAPV
..... Tuncated for character count....
>Anole
QALLSWGYSSETKFEITLNNKDSLVGDQDTLASFGIVSGDLICLILEDDASSPSSSLPSS
QSNHHSGPSQEFTSEGGPDDLDLQEATGSFPSEPMLCCEATDGQVPHSLQTLYHSAECTN
ANDALIVSIHLIMMETGYVPQGTEAKASSMPENWRNKGVYKLLYTHPLCENGFAVLTCVP
LGNLIVVNAMLKITSDIKSVKRLQLLPTSFICFQDSANVVGVYKDLQKLSRLFKDRLVYP
LLAAARQALNLPDVFGLVVLPLELKLRIFRLLDFRSLLSLSAVCHDLYAASNDQLLWRFI
YLRDFRDPVARSRDTDWKELYKKKMKQKDALRWRHMMFLPPLHPNPLYPNPFPLYPPMII
GEMDERPSLFPSHLDPFGSFQNPNPTL
>Human
QSLLTWGYSSNTRFTITLNYKDPLTGDEETLASYGIVSGDLICLILQDDIIPSSTSEHSS
LQNNSNGPSQNFEAESIQDNAHMAEGTGFYPSEPMLCSESVEGQVPHSLETLYQSADCSD
ANDALIVLIHLLMLESGYIPQGTEAKALSMPEKWKLSGVYKLQYMHPLCEGSSATLTCVP
LGNLIVVNATLKINNEIRSVKRLQLLPESFICKKLGENVANIYKDLQKLSRLFKDQLVYP
LLAFTRQALNLPDVFGLVVLPLELKLRIFRLLDVRSVLSLSAVCRDLFTASNDPLLWRFL
YLRDFRDNTVRVQDTDWKELYRKRHIQRESPKGRFVMLLPPFYPNPLHPRPFPRLPPGII
GEYDQRPSLIPPRFDPVGPLPGPNPIL
>Zebrafish_modified
ESLLRFGLRSDLDFRLSLSGKEDLLDTGQSLSSCGVVSGDLISVILPASSQTSSAAHQTH
TDQQSSQECVDLQQDCMDQQQQQEQECVCAAAPPLLCCEAEDGLLPLALERLLDSSTCRS
PSDCLMLALHLLLLETGFIPQGGAVSSGEMPIGWQAAGVFRLQYVHPLLENSLVSVVAVP
MGQTLVINAVLKMETSLANSRKLLLKPDEYVTAWTGGSSGVVYRDLRRLSRLVRDQLVYP
LMATARQALGLPLLFGLPVLPPELLLRLLRLLDVRSLVSLSAVCRHLNTATHDASLWRHL
LHRDFRVSFPAHRDTDWRELYKQKYRQRAARRGRHWFYPPPISPLIPFPSSPALYPPGII
GDYDQMPILPRPRFHPIGPLPGMSAPL

Output:

$ cat similar_seqs.fa

>Zebrafish
ESLLRFGLRSDLDFRLSLNGKEDLLDTGQSLSSCGVVSGDLISVILPASSQTSSAAHQTH
TDQQSSQECVDLQQDCMDQQQQQEQECVCAAAPPLLCCEAEDGLLPLALERLLDSSTCRS
PSDCLMLALHLLLLETGFIPQGGAVSSGEMPIGWQAAGVFRLQYVHPLLENSLVSVVAVP
MGQTLVINAVLKMETSLENSRKLLLKPDEYVTAWTGGSSGVVYRDLRRLSRLVRDQLVYP
LMATARQALGLPLLFGLPVLPPELLLRLLRLLDVRSLVSLSAVCRHLNTATHDASLWRHL
LHRDFRVSFPAHRDTDWRELYKQKYRQRAARRGRHWFYPPPISPLIPFPSSPALYPPGII
GDYDQMPILPRPRFHPIGPLPGMSAPV
>Zebrafish_modified
ESLLRFGLRSDLDFRLSLSGKEDLLDTGQSLSSCGVVSGDLISVILPASSQTSSAAHQTH
TDQQSSQECVDLQQDCMDQQQQQEQECVCAAAPPLLCCEAEDGLLPLALERLLDSSTCRS
PSDCLMLALHLLLLETGFIPQGGAVSSGEMPIGWQAAGVFRLQYVHPLLENSLVSVVAVP
MGQTLVINAVLKMETSLANSRKLLLKPDEYVTAWTGGSSGVVYRDLRRLSRLVRDQLVYP
LMATARQALGLPLLFGLPVLPPELLLRLLRLLDVRSLVSLSAVCRHLNTATHDASLWRHL
LHRDFRVSFPAHRDTDWRELYKQKYRQRAARRGRHWFYPPPISPLIPFPSSPALYPPGII
GDYDQMPILPRPRFHPIGPLPGMSAPL

The reference sequence will always be included at present, but its easy to modify this to remove this if necessary.

ADD COMMENT

Login before adding your answer.

Traffic: 1810 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6