First of all, I want to recommend strongly that you not do what you are planning. I have been building HHblits-like databases of PDB structures on a monthly basis since 2005. Back then there were other tools to gather and align members, but eventually I switched the whole thing to HHblits. This database has over 100,000 HMMs and gets 300-400 new members each month. Just a monthly update is a fairly large undertaking that requires a lot of computer time and a fairly large RAM. I can't imagine doing it from the scratch on anything smaller than a super-cluster, and it would still take many months. Besides, HHsuite already has such a database based on PDB structures and clustered at 70% identity:
http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/
The latest version is from Nov 17th, 2021 so it isn't even a week old. Please don't take offense, but I can't imagine that you would do a better job at it than HHsuite authors, or that you can dedicate more resources to it than what they already do.
If you still want to go through with this - again, I don't think you should - you may want to consider a different order of steps. To your question #1, I don't think you need to download the whole PDB database - you would be looking at ~180,000 files that are protein structures. This is because there is a huge redundancy in protein structures. There are ways to download all protein sequences of PDB entries without downloading the structures.
https://ftp.wwpdb.org/pub/pdb/derived_data/
You want the file pdb_seqres.txt
. Once you download it, I suggest you remove the redundancy at a sequence level before doing anything with structures. When that is done, it will give you only a relatively small number of structures to download and process. Keep in mind that this is very relative, because tens of thousands of structures is still a large number.
As to your question #2, PDB structures in most cases contain links to UniProt numbers, though I don't know of an automatic way to extract them. If you look at my favorite structure, you will see after scrolling down that this structure corresponds to this UniProt entry. That information is likely to be present both in PDB and CIF files and is simply a matter of parsing it out once you settle on a reasonable number of structures. My question to you is why would you want to ignore the mutants and link them to non-mutated UniProt entries? What matters ultimately is the protein sequence in the structure itself, because that is the only thing that can be used for modeling.
Hello!
Many thanks for your advices and detailed feedback. I was not aware that database preparation for HHblits is that time consuming and computationally expensive.
I am not very familiar with MSA, and HH-suite. From the wiki page enter link description here, custom library preparation seems a little complicated and tricky for me. I considered first using directly PDB70 but here are my issues/what I am trying to do:
I have a list of target proteins for which I would like to search for homologous proteins with available 3D structures. I will not model my target proteins. I am only interested in sequence similarity search at a first stage. I can use PDB70 to search for homologous proteins to my target sequences, but in this case, I will have to modify the database to remove the target proteins. Thus, (my new) NPDB70 = PDB70 - target proteins.
For this, I downloaded the PDB70 from the link you shared with me. However, I am a little bit confused about the content of the different files. For example, I checked in the file pdb_filter.dat for one of my target proteins and the PDBid was there. However, when I checked the other files (db_cs219.ffindex, db_hhm.ffindex and db_a3m.ffindex) the PDBid was not there. Also the number of lines between these three files and the pdb_filter.dat are not the same. Do you have an idea why is it is like this?
I understand that a3m files contains the MSA of each sequence of protein in the database, hhm files are representation in hidden Markov models for each MSA. However, I am a little bit confused about the information that cs219 file contains? Do you know about this?
ffindex_modify -s -u -f files.dat <db>_a3m.ffindex
. Same command also for hhm.ffindex and cs219.ffindex. This deletes the file entries from the ffindex files, however the files are still in the ffdata file. This way HHblits and HHsuite won’t be able to use them. According to the wiki page, If we want to get rid of them in the ffdata file we may rebuild the databases. My question is the following: since these entries have been used to generate the different MSA and HHM profiles in the library, if I build the same database, with and without these entries, i would expect different final results of MSA and HHM profile, right? If this is the case, then, it is not enough in my case to just remove the entries but also to rebuild the database. Is it correct?Many thanks in advance for your help.