Hello Everyone. For my summer project (I am a high schooler), I investigated AlphaFold and wrote a mostly-review paper on it, titled AlphaFold: A Beginner’s Guide And In-depth Exploration Of The Revolutionary AI Tool And Its Inner Workings. I wanted to ask this community if there are any glaring errors or mistakes in my paper or how I decided to structure it. I also wanted to ask if there is any scope for this paper beyond publishing it, such as if there is any interest in using it as a guide or applying any of the ideas within. Below is the abstract of my paper.
AlphaFold is a machine learning program that uses a protein’s sequence and its evolutionary history to create a model of the 3D structure of said protein. This paper aims to analyze the purpose, methodology, and structure of AlphaFold’s neural networks and code to provide a clearer understanding of AlphaFold as well as to show the limitations and areas of improvement it has. AlphaFold analyzes the evolutionary history of a protein as well as similarities to known proteins and pairwise evolutionary correlations to predict protein structures to high accuracy using the Evoformer neural network to create a refined pair representation which is converted into a 3D structure by an additional structure stage. Additionally, by analyzing the code from both AlphaFold (found on GitHub) and ColabFold, this paper shows specific customizable features which can change the type of model, MSA, and pair mode depending on whether the protein sequence is a monomer or multimer, or by user preference to optimize for speed, accuracy, etc, and it also demonstrates the methodology of AlphaFold by visualizing the structure of the code. Finally, limitations of AlphaFoldv2 include struggling with novel, variable, and non-static proteins, as well as not being able to model vital functions and molecular structures closely related to static proteins; several ideas to improve AlphaFold are proposed and/or analyzed for feasibility that vary in scope, and an analysis of how newer versions of AlphaFold (AlphaFoldv3) remedy these limitations is also included. These questions and analyses provide a deeper understanding of AlphaFold.
Here is the link to my paper if anyone is interested in reading it. I formatted it specifically to submit to the AJSR journal for high schoolers.
https://docs.google.com/document/d/1WID-0boVKapNz7ANE3M3aIorr0GPmDRCT_lnjmZS8xE/edit?tab=t.0
Thank you.
I would add to/caveat this latter part a little as well in so far as whether or not a modelling approach "works" on a novel or recalcitrant protein target is somewhat in the eye of the beholder.
Of course, objectively AF isn't getting the structure of the protein right (but as you rightly point out, neither was anyone else). One can go so far as to argue that no crystal structures are "right" because the true conformation of the protein is rarely if ever what it appears on your computer screen, so the notion of a 'ground truth' is a bit misleading.
We use AF to predict proteins which don't currently have any structures and we know many of the models it produces vary from wrong to outright garbage, but even still, AF does a better job than pretty much anything else with these particular proteins, and has given us helpful hypothesis generation and guidance for recombinant studies we simply wouldn't have had without it.
This is all a long way of saying for our purposes, AF "works", even if it isn't 100% correct in its predictions. I am then of course reminded of this:
Thank you so much for your feedback, I broke up the abstract into more digestible chunks and de-emphasized the limitations of AlphaFold within the latter portions of the abstract. In the actual paper, the limitations only make up a small section, so I realize why it shouldn't be one of the major points in the abstract. If you decide to read the paper please don't hesitate to be harsh on the feedback; I am completely new to the field and just trying to learn.
Here is the revised abstract:
AlphaFold is a machine learning program that uses a protein’s sequence to create a 3D model of the protein’s structure. This paper aims to analyze the purpose, methodology, and structure of AlphaFold’s neural networks and code to provide a clearer understanding of AlphaFold as well as to show the limitations and areas of improvement it has. AlphaFold analyzes the evolutionary history of a protein and similarities to known proteins and pairwise evolutionary correlations to predict protein structures to high accuracy. It uses the Evoformer neural network to create a refined pair representation which is converted into a 3D structure by an additional structure stage. Additionally, by analyzing the code from both AlphaFold (found on GitHub) and ColabFold, this paper shows specific customizable features which can change the type of model, MSA, and pair mode during the prediction. These features change depending on whether the protein sequence is a monomer or multimer, or by user preference to optimize for speed, accuracy, etc. They demonstrate the methodology of AlphaFold by visualizing the structure of the code. While the tool struggles to predict novel, variable, and non-static proteins, it is still a greatly significant tool and greatly useful in most cases. Newer versions of AlphaFold (AlphaFoldv3) continue to iterate and expand the functionality of the tool by expanding predictions to ligands, free nucleic acids, ions, and complexes.
Again, thank you for your feedback, you have helped me so much.
I think your abstract reads better, but it isn't something to be fixed in a couple of hours. Again, many people may read the abstract and nothing else, so it is important to be precise and economical. I will illustrate that on your first sentence:
Would you rather say that
bakers use eggs and sugar to make a cake
or thatbakers make cakes from eggs and sugar
? Even though the overall meaning of both statements is the same, I think it is more important to state what bakers do, followed by how they do it, than the other way around. Using a similar logic, an alternative to your first sentence could be something likeAlphaFold is a machine learning program that predicts a protein's structure from its sequence.
There is also an important distinction between AFpredicting
versuscreating
models.I have no intention of breaking down each sentence in your abstract, but hopefully this illustrates the deliberate approach that I advocate for when writing an abstract.