Let's say I have a large database of cdna sequences in the FASTA format, and I would like to identify a motif in the corresponding amino acid sequence. Let's say I need to find something like:
CxxCxxxxxxxxxxxxHxxx$
where $
will be H
or C
I imagine one would start by parsing the fasta files, find the sites where these sub-sequences have to be, then
translate the corresponding coding DNA sequence, then I end up with an amino acid sequence that contains a sequence of this form. If I had a specific amino acid sequence in mind, I could easily find it by using the .find()
method in the biopython module. However, I'm not sure how one can try to identify a form like above, in which one would search for a set of motifs.
Thanks!
The questions needs clarity at quite a few places. To start with, you have a database of sequences of which type? "FASTA" is the format, gives us nothing on the type of the underlying sequence.
Sorry, these are cdna sequences that are parsed from a set of FASTA files. Then I translate the sections between the KpnI and BamHI sites. With the amino acid sequence, I then need to find a sub-sequence that matches the pattern:
where
$
will beH
orC
I hope that is more clear.