How do I verify that a sequence only contains letters from a given alphabet: DNA, RNA, protein?
How do I verify that a sequence only contains letters from a given alphabet: DNA, RNA, protein?
Here is an efficient function written in Python:
dna = set("ATGC")
def validate(seq, alphabet=dna):
"Checks that a sequence only contains values from an alphabet"
leftover = set(seq.upper()) - alphabet
return not leftover
# typical usage below
# this will print True
print validate('AAAATGCCG')
# this will print False
print validate('AAANTGCCG')
# using it with other alphabets
prot = set('ACDEFGHIKLMNPQRSTVWY')
print validate("mglsdgewql", alphabet=prot)
Use regular expressions. In python, you can do:
import re
def validate(seq, alphabet='dna'):
"""
Check that a sequence only contains values from an alphabet
>>> seq1 = 'acgatGAGGCATTtagcgatgcgatc' # True for dna and protein
>>> seq2 = 'FGFAGGAGFGAFFF' # False for dna, True for protein
>>> seq3 = 'acacac ' # should return False (a space is not a nucleotide)
>>> validate(seq1, 'dna')
True
>>> validate(seq2, 'dna')
False
>>> validate(seq2, 'protein')
True
>>> validate(seq3, 'dna')
False
"""
alphabets = {'dna': re.compile('^[acgtn]*$', re.I),
'protein': re.compile('^[acdefghiklmnpqrstvwy]*$', re.I)}
if alphabets[alphabet].search(seq) is not None:
return True
else:
return False
if __name__ == "__main__":
import doctest
doctest.testmod(verbose=True)
In BioPerl SeqIO will guess the alphabet of a sequence. There are often warnings that these guesses can be imperfect! This is an issue with many of the home-brew answers here, they don't use the full IUPAC code. BioPerl is aware of the full IUPAC alphabet, and I would guess BioPython etc are too.
If you have a sequence containing only ACGT this can be correctly identified as either amino acids or nucleotides. Both are correct. A solution will therefore have to estimate a likelihood, perhaps given an observed frequency of amino acid residues in proteins of your data set?. This can be simple sometimes- if it contains an E residue it can't be a valid nucleotide sequence. Other times you will really need to specify biological estimates of likelihood of encountering certain residues in a sequence of a given length. I don't know of any software that does this, hence all are a bit error prone, at least with short or biased-composition sequences.
Take home message; its not quite so simple as sometimes assumed. If you want a good basic guess, one of the Bio* projects wil probably be as good as anything.
From a theoretical point of view, to make this quickly, you can base your code on the specificity of each alphabet. (Here I suppose you just want to know if a sequence is protein, DNA or RNA, based on Giovanni's comment)
RNA contains 'U', and not the others.
Proteins have much more letters than DNA and RNA.
So here is a piece of code in C++
#include <iostream>
#include <string>
/* suppose you give the raw sequence in capital *
* letters in argument of your program */
using namespace std;
int main(int argc, char **argv){
string dna_alphabet = "ATCG";
string rna_alphabet = "AUCG";
string prot_alphabet = "ATCGRNDEQHILKMFPSWYV";
char * sequence = argv[1];
int i = 0;
while (sequence[i] != '\0'){
if (sequence[i] == 'U') {
cout << "This is a RNA sequence\n";
return 0;
}
if (prot_alphabet.find(sequence[i]) < prot_alphabet.size() && dna_alphabet.find(sequence[i]) > dna_alphabet.size()) {
cout << "This is a Protein sequence\n";
return 0;
}
++i;
}
cout << "This is a DNA sequence\n";
return 0;
}
@Farhat: even a human cannot answer this question... so a computer ? who knows ;)
@Pierre: For large alphabet, using lower_bound suppose the alphabet is sorted. And if it is sorted, I think string::find have the same complexity as lower_bound (not really sure but it is logical).
In C, see the standard function strspn: http://www.cplusplus.com/reference/clibrary/cstring/strspn/
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
the idea of this discussion was to collect all the possible ways and tools to determine if a sequence is DNA or protein
If you give the programming language you want, we can answer more precisely ;)
@bilouweb: just write the solution in the programming language you prefer, other people will implement it in the other languages.