I'm new to both protein structure prediction and the use of AI-based tools like Alphafold2 or RoseTTAFold. And I have a few questions:
1. Is it possible to use structure prediction by AlphaFold2 to validate HMMER based domain sequence predictions? If yes, what would be the steps? I have some idea, but not sure if it will work, and seek your input / advice?
2. For predicting structure of a protein domain sequence, should I feed the software
the entire protein sequences, or
just the domain sequence, or
domain sequence + some bordering additional aa residues?
3. Currently for AlphaFold2, what are ready-to-use virtual machines that can be launched as an end-user, rather than have me fumble with installation etc?
4. Is there any cloud-based server that provides not just approximate predictions, but results that i would obtain using the full install? For example, some of the google free-to-use python co-lab pages are not full implementations AFAIK, that use a shortcut for the computationally intensive MSA step, and so wouldn't these predictions be not as accurate?
5. Would the prediction pipeline you suggest also be practical for high throughout prediction of structure to validate my domain predictions, where I can automate submission for ~1500 sequences (either full length protein or domain-only sequences) Is it even possible to automate submission to some of the free-to-use Google's Co-lab Py notebooks (from Martin Steinegger's group)?
Thanks in advance.
Not my area of expertise particularly but;
I don't think you can use a structure prediction tool to really 'validate' HMMER predictions. I'm pretty sure most structure predictors are relying on HMMER or similar HMM based approaches (Martin told me AlphaFold leans on HHBlits API calls for example). I would argue that the HMM's are probably 'closer to the raw data' and really you're validating a prediction with HMMs rather than the other way around - though this is a fairly nitpicky point (and all data in concert together is never a bad thing).
Generally the whole sequence, but there's no one size fits all answer. If its a very large protein, or has domains that are not well characterised, then you may get better results feeding in separate domains. The predictor ITASSER, for instance, actually has a hardcoded limit of 1500 AAs, so if your protein is too large, you'd have no choice but to break it up. The larger the sequence you submit, the more likely the simulation will be of lower accuracy generally.
Not really sure on this, so others may weigh in. I believe the github page offers a Docker image with alphafold ready to run. I have a vague recollection of there being a webserver available but don't know for certain offhand.
I'm not aware of 'short-cuts' per se (but again, not my area of expertise). As mentioned, I know part of the alphafold API calls out to the HHSuite set of tools. I don't know that this would be any less accurate though, as HHSuite is a very good set of HMM and alignment tools. I can't imagine you could hope for much better alignment/templating with other approaches/tools.
Protein prediction remains a very computationally costly task, so the only real way you'd ever obtain structures for 1500 sequences in a short space of time would be to run the software locally on infrastructure you control. I'm aware of Martin's Co-lab set up, but afraid I don't know whether you could easily mass-auto-submit jobs.
Joe Thanks a TON for your point by point replies. Please find below my comments / thoughts.
HMMER3 based searches (using hmmsearch or pfamscan) yield sequence matches with varying E-values and bitscore and length variation. Some of these sequence predictions may not be 'true positives'. And therefore, I started thinking of using structure predictions to classify my HMMER3-based sequence predictions into 'false positives', 'true positives' and 'uncertain' categories.
I have done 3D superposition of available PDBs for this protein domain, and they overlap well, even at ~ < 20% pairwise identity - this is not surprising though, but only confirms structure is better conserved than sequence. And so my idea is this - if I set conditions such as these - if more than half of the 3 helices in the structure are missing, AND/OR contact resides that allow physical interaction with their protein partner(s) are missing, I would classify as 'uncertain' or 'false positive'. This idea is still amorphous and evolving. What do you think?
On Google CoLab Py notebook, that runs a simplified version of AlphaFold2 [or was it RoseTTAFold ? - can't remember, but it may be an important distinction] - 1 submission took ~10 minutes to return the results. This is super quick compared to most other methods... So its not total computational time that is a challenge, but the impracticality of submitting 1500 entries manually, one by one... Are there any workarounds to this?
Best. Charlotte
You could certainly try it - but I'm not sure if this is going to achieve the objective. If your HMM matches have E-values that are sufficiently low (1e-06 is a pretty common default), I wouldn't be hugely worried that they are false positives. If you have hits that have poor E-values, then this would suggest there isn't really much structural information available to match to, and in all likelihood, you'd get poor structure simulations anyway (so you'd just end up comparing junk to junk).
Without knowing what the proteins are, I can't really say whether this sounds like a reasonable set of criteria or not. If you have some pre-existing biological insight as to what the 'important' features of these proteins are, then sure you can probably draw up some criteria like that. I would be inclined to use (or at least start with) some more generalist and objective metrics of structure similarity, for example TM score).
Joe - will look into your suggestions, thank you very much :)
BTW, for pairwise comparison of structures, is TM-score currently the best option? Is it computationally intensive for a local install and run? If yes, are there 'lighter' alternatives for a quicker and dirtier initial analyses of my predicted structures? TIA!
TMscore is a single C++ program that compiles into a single file:
It takes about a second to run for a pair of structures.
OK, that's super fast . So no software dependencies? Doesn't seem to need other software based on quick look at https://zhanglab.dcmb.med.umich.edu/TM-score/, TIA!
From my (albeit limited) experience, I believe TM-score to be one of the best current metrics (as it overcomes some of the limitations of RMSD). I don't recall it being particularly computationally intensive when I was running it on around 1600 proteins.
Little update, Martin confirmed that, at the moment, direct batch submission isn't possible:
Does that sound reasonable? Do you think there are hurdles to even this sort of a simple(r) approach?
PS. apologies for my delayed reply, I think Biostars email notification went to spam!
Are you sure you really need to model all 1000? Your approach of reducing the dataset through doing some clustering or something sounds sensible.
Do you have access to any local HPC or workstations you can run simulations on locally? During my PhD I ran simulations on many hundreds of sequences using a local ITASSER install on a pretty standard server (32 cores, 100gb RAM)
WHich brings me to my next question - do you have to use AlphaFold? It's accurate sure, but for many proteins for which there is already good experimental structural data, ITASSER etc will still perform very well (and may be faster, I don't know).
Otherwise the approach you mention sounds reasonable to me.
So with your help and others in this post, my analyses pipeline and strategy is decided, thank you all v. much :)
Consider using Tamarind Bio! (https://www.tamarind.bio/app) It's free for academic use and can do multiple predictions simultaneously. It can also do predictions up to 5000 amino acids for both complexes and monomers.
Thanks for the recommendation, this worked for me
You mean notebooks like https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb, https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb ? According to their own colab page - "While accuracy will be near-identical to the full AlphaFold system on many targets, a small fraction have a large drop in accuracy due to the smaller MSA and lack of templates", looks like this is one of the best possible alternatives to downloading and installing AlphaFold2 on local system as it takes up about 2.2TB space for databases