Question

How to check if my sequence is DNA or Protein in BioPython?

0

Entering edit mode

13 months ago

O.rka ▴ 740

I'm trying to develop a script that checks whether or not the input fasta is the correct molecule type but I want it to account for ambiguous characters.

Here's what I have for reading in fasta:

from Bio.SeqIO.FastaIO import SimpleFastaParser

with open(filepath, "r") as f:
    for header, seq in SimpleFastaParser(f):
        #func(seq) --> {dna or protein}

I know that Alphabet was removed so I'm a little confused on how to check this with BioPython. I could always look up the IUPAC alphabet for DNA and proteins then check if the sequence set is a subset but I feel like there has to be a better way.

genomics biopython fasta • 1.0k views

ADD COMMENT • link updated 13 months ago by Joe 21k • written 13 months ago by O.rka ▴ 740

1

Entering edit mode

No need to look it up, it is already part of BioPython, see line 7 of BioPython's IUPACData.py file:

"""Information about the IUPAC alphabets."""

ADD REPLY • link 13 months ago by Wayne ★ 2.1k

1

Entering edit mode

I don't think any major sophistication is needed here. Amino acid frequencies for ACGT are well below 10% for proteins:

A 0.074
C 0.025
G 0.074
T 0.051

If for any two of them one counts > 10% frequency, it has to be a nucleic acid.

Similarly, C frequency in proteins is 2.5% which pretty much guarantees that dipeptides AC, GC and TC are going to be < 0.5%. On the other hand, dinucleotides AC, GC and TC should all be > 3-5% in nucleic acids. If any two of them are > 5% it has to be a nucleic acid.

These common-sense rules may fail on short individual sequences of unusual composition, but I think they should work on longer sequences or a group of 5+ sequences.

ADD REPLY • link 13 months ago by Mensur Dlakic ★ 28k

0

Entering edit mode

I think possibly even more simply (though no idea which is more computationally efficient), one could simply check that the set() of characters = [A, T, C, G, X] (or whatever ambiguous character is being used). If the sets don't match it has to be protein (or vice versa).

ADD REPLY • link 13 months ago by Joe 21k