If you ask a biologist, there should be only one GFP sequence (at least in the research lab setting). However, when I try to find GFP sequence online, it seems there are multiple completely different options. Some examples from different sources:
- http://www.ncbi.nlm.nih.gov/nuccore/155660?report=fasta
- http://www.snapgene.com/resources/plasmid_files/fluorescent_protein_genes_and_plasmids/GFP/
- https://www.addgene.org/11150/sequences/
- http://www.algosome.com/resources/common-sequences.html
Sometimes they include additional non-coding regions, but I am taking that into account. Why are the sequences so different right from the start codon? It's more than just a few mutations.
Yes, but it looks like they only mutated a few amino acids. I see that most of the sequence is different.
The mutations were made in the codon-optimized background so if you compare the original GFP sequence to the EGFP, you'll should find quite a few differences. There's also a mutation to render the protein monomeric (the original is believed to form dimers) giving the mEGFP protein and there are mutations to make the protein less stable, faster folding... And we should not forget that there are several sources (i.e. species) of GFP e.g. Aequorea victoria and Renilla reniformis.
That's helpful. I am not sure why this information is not more readily available considering how many people use GFP.
In my opinion, this is all due to the sloppiness with which information is recorded in biology.
Ouch! That's not entirely fair. In my experience, proper genetic engineers keep meticulous records with strict naming conventions.
It's when they pass these tools on to the "molecular" people that everything falls apart.
Let's say some biologists are sloppy. :)