Question

Expedite alphafold2 or RoseTTAFold for bulk jobs

1

Entering edit mode

2.9 years ago

btski ▴ 10

I'd like to preface this post by saying I do not have much experience in structural or molecular biology.

I'm interested in implementing alphafold2 (or RoseTTAFold) to generate high-confidence protein folding structures for a database of ~200,000 viral proteins, approximately 100 of which are unique. The proteins are generally short and don't exceed 1,000 AA. Running alphafold2 comprehensively for each protein is not possible. Ideally, I would like to limit database (both genetic and structural) searching as much as possible. I am considering the following protocol:

1). Run alphafold2 on a single reference protein. Retain database search output. 2). Identify other proteins within ~2% genetic similarity to reference (note - some proteins can be wildly diverse) 3). Use the reference database search output to jump-start processing proteins from step 2

The remainder of processing would be the same (and likely prohibitively expensive in time/resources). Is there a more efficient way to meaningfully predict protein folding (ie output that a structural or molecular biologist might find useful)?

alphafold protein-structure • 1.1k views

ADD COMMENT • link updated 9 months ago by Ram 44k • written 2.9 years ago by btski ▴ 10

score 2 · Answer 1 · 2022-01-06

Not going to sugarcoat it - for someone without "much experience in structural or molecular biology", you may have chosen too big a bite to chew. There are all kinds of logistic problems to install and successfully run AlphaFold2, both in terms of time and resources. Assuming all that goes well, interpreting the models and deciding which ones are useful is not a straightforward task without a background in structural biology of proteins.

One way of making your task less time consuming is to cluster the proteins into groups ahead of time. After that is done, you should end up with quite a smaller number of groups than ~200,000. The idea is that all the proteins in the group are related in terms of function and likely structure, so it is not necessary to model all of them in order to understand their structures. In fact, picking a single representative protein to model is usually enough. Your criterion of picking proteins within 2% as relatives is most likely too stringent. In most cases proteins of similar length that have at least 50% identity are very likely to have the same fold and very similar structures. That sometimes is true even for proteins that have only 20% identity.

Protein clustering can be done many different ways. Again accounting for your lack of experience, you may want to try MMseqs2 because it requires only a group of sequences, and will do all the steps automatially.

score 0 · Answer 2 · 2024-02-12

0

Entering edit mode

9 months ago

Jasper • 0

I've found using AlphaFold and Afcluster to helpful for a similar task. I've been using the online tool https://www.tamarind.bio/ for large scale AlphaFold and can do many predictions in parallel, making the process much more efficient.

ADD COMMENT • link 9 months ago by Jasper • 0

0

Entering edit mode

Are you affiliated with this company?

Are you working in a coordinated manner with Sherry to promote Tamarind Bio?

ADD REPLY • link 9 months ago by Ram 44k