Forum:A beginner's guide and In-depth exploration of AlphaFold
1
1
Entering edit mode
1 day ago

Hello Everyone. For my summer project (I am a high schooler), I investigated AlphaFold and wrote a mostly-review paper on it, titled AlphaFold: A Beginner’s Guide And In-depth Exploration Of The Revolutionary AI Tool And Its Inner Workings. I wanted to ask this community if there are any glaring errors or mistakes in my paper or how I decided to structure it. I also wanted to ask if there is any scope for this paper beyond publishing it, such as if there is any interest in using it as a guide or applying any of the ideas within. Below is the abstract of my paper.

AlphaFold is a machine learning program that uses a protein’s sequence and its evolutionary history to create a model of the 3D structure of said protein. This paper aims to analyze the purpose, methodology, and structure of AlphaFold’s neural networks and code to provide a clearer understanding of AlphaFold as well as to show the limitations and areas of improvement it has. AlphaFold analyzes the evolutionary history of a protein as well as similarities to known proteins and pairwise evolutionary correlations to predict protein structures to high accuracy using the Evoformer neural network to create a refined pair representation which is converted into a 3D structure by an additional structure stage. Additionally, by analyzing the code from both AlphaFold (found on GitHub) and ColabFold, this paper shows specific customizable features which can change the type of model, MSA, and pair mode depending on whether the protein sequence is a monomer or multimer, or by user preference to optimize for speed, accuracy, etc, and it also demonstrates the methodology of AlphaFold by visualizing the structure of the code. Finally, limitations of AlphaFoldv2 include struggling with novel, variable, and non-static proteins, as well as not being able to model vital functions and molecular structures closely related to static proteins; several ideas to improve AlphaFold are proposed and/or analyzed for feasibility that vary in scope, and an analysis of how newer versions of AlphaFold (AlphaFoldv3) remedy these limitations is also included. These questions and analyses provide a deeper understanding of AlphaFold.

Here is the link to my paper if anyone is interested in reading it. I formatted it specifically to submit to the AJSR journal for high schoolers.

https://docs.google.com/document/d/1WID-0boVKapNz7ANE3M3aIorr0GPmDRCT_lnjmZS8xE/edit?tab=t.0

Thank you.

AlphaFold ColabFold • 280 views
ADD COMMENT
2
Entering edit mode
1 day ago
Mensur Dlakic ★ 28k

I want to commend you for tackling an extremely difficult subject matter in your writing. This is a serious undertaking regardless of one's experience or educational level.

I might yet read your paper and offer additional insights, but here are some quick thoughts on your abstract.

AlphaFold analyzes the evolutionary history of a protein as well as similarities to known proteins and pairwise evolutionary correlations to predict protein structures to high accuracy using the Evoformer neural network to create a refined pair representation which is converted into a 3D structure by an additional structure stage. Additionally, by analyzing the code from both AlphaFold (found on GitHub) and ColabFold, this paper shows specific customizable features which can change the type of model, MSA, and pair mode depending on whether the protein sequence is a monomer or multimer, or by user preference to optimize for speed, accuracy, etc, and it also demonstrates the methodology of AlphaFold by visualizing the structure of the code. Finally, limitations of AlphaFoldv2 include struggling with novel, variable, and non-static proteins, as well as not being able to model vital functions and molecular structures closely related to static proteins; several ideas to improve AlphaFold are proposed and/or analyzed for feasibility that vary in scope, and an analysis of how newer versions of AlphaFold (AlphaFoldv3) remedy these limitations is also included.

The quoted section has 3 sentences, while it should be at least 5. I think too many thoughts are going into a single sentence, and the whole thing must be more digestible.

You have said the same thing twice in three sentences: that AF uses a protein’s sequence and its evolutionary history and analyzes the evolutionary history of a protein. You have the phrase as well as 3 times in 5 sentences, and several sentences where you have 2+ and words plus at least one as well as phrase. That goes back to writing complex sentences I referenced earlier - you have to know when to put a full stop.

I suggest you think about it this way: those who are going to read your paper beyond the title are almost guaranteed to read the abstract, but not the rest. I am the prime example of that approach. If the abstract is difficult to read (e.g., repetitive, unfocused) or inaccurate (see below), it is less likely that your potential readers will take the time to read the paper.

What I am about to say next is a relative criticism of your abstract, and not everyone would agree with it. You might have picked up the supposed deficiencies of AF by reading other papers, in which case you are only mimicking what others have done before you. It always feels disingenuous to me when people criticize AF for not working with what you call novel, variable, and non-static proteins. To me, that's like criticizing Ben and Jerry's for not having enough of a vodka note in their Cherry Garcia ice cream. Did they set out to make a sweet cream with cherry chunks that would be better than any other ice cream of that type, or did they mean to do all that AND pair it with vodka flavor? If the former, than they did exactly as planned, and the rest is just our unrealistic expectations. The AF creators, just like all protein modelers before them, set out to create a global protein modeling program. They succeeded spectacularly in that endeavor, and anyone who objects to that statement can log a belated complaint to the Swedish Academy. No protein modeling approach in existence before 2018 worked on novel, variable, and non-static proteins, because none of them worked even close to universally on predicting static protein structures. This is to say that one had to solve the general protein modeling problem first. Even now, when we start from a higher baseline provided by AF, there is no program that universally works on novel, variable, and non-static proteins. That is despite the fact that many groups have actually set out to model novel, variable, and non-static proteins, unlike AF. Not that many people choose to focus on dozens of other protein-related problems that were solved by generating AF weights and additional training. Rather, people focus on AF not solving all the other problems it never set out to solve, including novel, variable, and non-static proteins. It is obviously up to you what goes into the abstract, but I would never put these limitations in the abstract. I think it is a fair game to mention that AF cannot model every single protein and all aspects of their dynamics, but to me that's not an abstract item. If I were writing this piece, the highlight would be on all the problems AF unintentionally helped to solve, rather than those it didn't.

A couple of potentially useful resources are listed below for AF v2 and v3.

ADD COMMENT
3
Entering edit mode

I would add to/caveat this latter part a little as well in so far as whether or not a modelling approach "works" on a novel or recalcitrant protein target is somewhat in the eye of the beholder.

Of course, objectively AF isn't getting the structure of the protein right (but as you rightly point out, neither was anyone else). One can go so far as to argue that no crystal structures are "right" because the true conformation of the protein is rarely if ever what it appears on your computer screen, so the notion of a 'ground truth' is a bit misleading.

We use AF to predict proteins which don't currently have any structures and we know many of the models it produces vary from wrong to outright garbage, but even still, AF does a better job than pretty much anything else with these particular proteins, and has given us helpful hypothesis generation and guidance for recombinant studies we simply wouldn't have had without it.

This is all a long way of saying for our purposes, AF "works", even if it isn't 100% correct in its predictions. I am then of course reminded of this:

enter image description here

ADD REPLY
1
Entering edit mode

Thank you so much for your feedback, I broke up the abstract into more digestible chunks and de-emphasized the limitations of AlphaFold within the latter portions of the abstract. In the actual paper, the limitations only make up a small section, so I realize why it shouldn't be one of the major points in the abstract. If you decide to read the paper please don't hesitate to be harsh on the feedback; I am completely new to the field and just trying to learn.

Here is the revised abstract:

AlphaFold is a machine learning program that uses a protein’s sequence to create a 3D model of the protein’s structure. This paper aims to analyze the purpose, methodology, and structure of AlphaFold’s neural networks and code to provide a clearer understanding of AlphaFold as well as to show the limitations and areas of improvement it has. AlphaFold analyzes the evolutionary history of a protein and similarities to known proteins and pairwise evolutionary correlations to predict protein structures to high accuracy. It uses the Evoformer neural network to create a refined pair representation which is converted into a 3D structure by an additional structure stage. Additionally, by analyzing the code from both AlphaFold (found on GitHub) and ColabFold, this paper shows specific customizable features which can change the type of model, MSA, and pair mode during the prediction. These features change depending on whether the protein sequence is a monomer or multimer, or by user preference to optimize for speed, accuracy, etc. They demonstrate the methodology of AlphaFold by visualizing the structure of the code. While the tool struggles to predict novel, variable, and non-static proteins, it is still a greatly significant tool and greatly useful in most cases. Newer versions of AlphaFold (AlphaFoldv3) continue to iterate and expand the functionality of the tool by expanding predictions to ligands, free nucleic acids, ions, and complexes.

Again, thank you for your feedback, you have helped me so much.

ADD REPLY
1
Entering edit mode

I think your abstract reads better, but it isn't something to be fixed in a couple of hours. Again, many people may read the abstract and nothing else, so it is important to be precise and economical. I will illustrate that on your first sentence:

AlphaFold is a machine learning program that uses a protein’s sequence to create a 3D model of the protein’s structure.

Would you rather say that bakers use eggs and sugar to make a cake or that bakers make cakes from eggs and sugar? Even though the overall meaning of both statements is the same, I think it is more important to state what bakers do, followed by how they do it, than the other way around. Using a similar logic, an alternative to your first sentence could be something like AlphaFold is a machine learning program that predicts a protein's structure from its sequence. There is also an important distinction between AF predicting versus creating models.

I have no intention of breaking down each sentence in your abstract, but hopefully this illustrates the deliberate approach that I advocate for when writing an abstract.

ADD REPLY

Login before adding your answer.

Traffic: 2285 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6