Question

Forum:A beginner's guide and In-depth exploration of AlphaFold

1

Entering edit mode

4 months ago

shravansaranyan ▴ 20

Hello Everyone. For my summer project (I am a high schooler), I investigated AlphaFold and wrote a mostly-review paper on it, titled AlphaFold: A Beginner’s Guide And In-depth Exploration Of The Revolutionary AI Tool And Its Inner Workings. I wanted to ask this community if there are any glaring errors or mistakes in my paper or how I decided to structure it. I also wanted to ask if there is any scope for this paper beyond publishing it, such as if there is any interest in using it as a guide or applying any of the ideas within. Below is the abstract of my paper.

AlphaFold is a machine learning program that uses a protein’s sequence and its evolutionary history to create a model of the 3D structure of said protein. This paper aims to analyze the purpose, methodology, and structure of AlphaFold’s neural networks and code to provide a clearer understanding of AlphaFold as well as to show the limitations and areas of improvement it has. AlphaFold analyzes the evolutionary history of a protein as well as similarities to known proteins and pairwise evolutionary correlations to predict protein structures to high accuracy using the Evoformer neural network to create a refined pair representation which is converted into a 3D structure by an additional structure stage. Additionally, by analyzing the code from both AlphaFold (found on GitHub) and ColabFold, this paper shows specific customizable features which can change the type of model, MSA, and pair mode depending on whether the protein sequence is a monomer or multimer, or by user preference to optimize for speed, accuracy, etc, and it also demonstrates the methodology of AlphaFold by visualizing the structure of the code. Finally, limitations of AlphaFoldv2 include struggling with novel, variable, and non-static proteins, as well as not being able to model vital functions and molecular structures closely related to static proteins; several ideas to improve AlphaFold are proposed and/or analyzed for feasibility that vary in scope, and an analysis of how newer versions of AlphaFold (AlphaFoldv3) remedy these limitations is also included. These questions and analyses provide a deeper understanding of AlphaFold.

Here is the link to my paper if anyone is interested in reading it. I formatted it specifically to submit to the AJSR journal for high schoolers.

https://docs.google.com/document/d/1WID-0boVKapNz7ANE3M3aIorr0GPmDRCT_lnjmZS8xE/edit?tab=t.0

Thank you.

AlphaFold ColabFold • 2.0k views

ADD COMMENT • link updated 4 months ago by Mensur Dlakic ★ 29k • written 4 months ago by shravansaranyan ▴ 20

score 3 · Accepted Answer · 2025-01-07

I want to commend you for tackling an extremely difficult subject matter in your writing. This is a serious undertaking regardless of one's experience or educational level.

I might yet read your paper and offer additional insights, but here are some quick thoughts on your abstract.

AlphaFold analyzes the evolutionary history of a protein as well as similarities to known proteins and pairwise evolutionary correlations to predict protein structures to high accuracy using the Evoformer neural network to create a refined pair representation which is converted into a 3D structure by an additional structure stage. Additionally, by analyzing the code from both AlphaFold (found on GitHub) and ColabFold, this paper shows specific customizable features which can change the type of model, MSA, and pair mode depending on whether the protein sequence is a monomer or multimer, or by user preference to optimize for speed, accuracy, etc, and it also demonstrates the methodology of AlphaFold by visualizing the structure of the code. Finally, limitations of AlphaFoldv2 include struggling with novel, variable, and non-static proteins, as well as not being able to model vital functions and molecular structures closely related to static proteins; several ideas to improve AlphaFold are proposed and/or analyzed for feasibility that vary in scope, and an analysis of how newer versions of AlphaFold (AlphaFoldv3) remedy these limitations is also included.

The quoted section has 3 sentences, while it should be at least 5. I think too many thoughts are going into a single sentence, and the whole thing must be more digestible.

You have said the same thing twice in three sentences: that AF uses a protein’s sequence and its evolutionary history and analyzes the evolutionary history of a protein. You have the phrase as well as 3 times in 5 sentences, and several sentences where you have 2+ and words plus at least one as well as phrase. That goes back to writing complex sentences I referenced earlier - you have to know when to put a full stop.

I suggest you think about it this way: those who are going to read your paper beyond the title are almost guaranteed to read the abstract, but not the rest. I am the prime example of that approach. If the abstract is difficult to read (e.g., repetitive, unfocused) or inaccurate (see below), it is less likely that your potential readers will take the time to read the paper.

What I am about to say next is a relative criticism of your abstract, and not everyone would agree with it. You might have picked up the supposed deficiencies of AF by reading other papers, in which case you are only mimicking what others have done before you. It always feels disingenuous to me when people criticize AF for not working with what you call novel, variable, and non-static proteins. To me, that's like criticizing Ben and Jerry's for not having enough of a vodka note in their Cherry Garcia ice cream. Did they set out to make a sweet cream with cherry chunks that would be better than any other ice cream of that type, or did they mean to do all that AND pair it with vodka flavor? If the former, than they did exactly as planned, and the rest is just our unrealistic expectations. The AF creators, just like all protein modelers before them, set out to create a global protein modeling program. They succeeded spectacularly in that endeavor, and anyone who objects to that statement can log a belated complaint to the Swedish Academy. No protein modeling approach in existence before 2018 worked on novel, variable, and non-static proteins, because none of them worked even close to universally on predicting static protein structures. This is to say that one had to solve the general protein modeling problem first. Even now, when we start from a higher baseline provided by AF, there is no program that universally works on novel, variable, and non-static proteins. That is despite the fact that many groups have actually set out to model novel, variable, and non-static proteins, unlike AF. Not that many people choose to focus on dozens of other protein-related problems that were solved by generating AF weights and additional training. Rather, people focus on AF not solving all the other problems it never set out to solve, including novel, variable, and non-static proteins. It is obviously up to you what goes into the abstract, but I would never put these limitations in the abstract. I think it is a fair game to mention that AF cannot model every single protein and all aspects of their dynamics, but to me that's not an abstract item. If I were writing this piece, the highlight would be on all the problems AF unintentionally helped to solve, rather than those it didn't.

A couple of potentially useful resources are listed below for AF v2 and v3.