Hello!
I have a quick question regarding bacterial start codons. I have downloaded the full plasmid database set of sequences from ncbi here: ftp://ftp.ncbi.nlm.nih.gov/genomes/Plasmids
I've made blast databases of the full nucleotide sequences of the plasmids (ffn) as well as of the resulting proteins (faa) of each plasmid. What I've done is to BLASTp against the coding proteins database using various input sequences, and used the resulting proteins to tBLASTn against the full plasmid nucleotide sequences to extract all the 100% identical nucleotide variants of these proteins. This works quite well, but in some cases the sequence returned does not have a typical start codon which you would expect. Normally, you'd expect to see either ATG, GTG, or TTG as the start codon. But here I am with a sequence whose start codon is AAA, coding for lysine, which results in my identity being less than 100%.
Now, my guess is that with automatic annotation the start codon is set to Methionine if the remaining sequence has a 100% match to whatever it is being used as the reference, but if this is the case or not I have no idea.
Could someone with a bit of experience chime in and help me out?
Thank you in advance!
Edit:
Turns out there was a minor bug in my code which disabled the reverse complementary function, causing the ends to flip. My apologies.