Create custom sequence database for HHBLIT from the PDB
1
0
Entering edit mode
3.0 years ago
sizeineb • 0

Hello,

I am interested in creating a custom sequence database from the PDB for use with HHBLIT. Since I am not familiar with this tools, I have some questions: 1- The custom sequence database should contain only sequences of proteins for which 3D structure is available. In HH-user guide enter link description herethey are using rsync as follow: enter link description here However, this link will download all entries in the PDB, not only ones corresponding to protein structures but also the ones corresponding to nucleic acid only. right? If this is the case, is there a similar easy way to download only the protein files in cif formal?

2- Next, is to generate the sequences of the proteins. They use cif2fasta.py. Since proteins in the PDB may contain mutations and missing parts, is there a way to obtain the FASTA sequences of the downloaded proteins as they are in UniProt database?

Many thanks in advance for your help.

HH-suite PDB MSA FASTA hhblit • 1.8k views
ADD COMMENT
2
Entering edit mode
3.0 years ago
Mensur Dlakic ★ 28k

First of all, I want to recommend strongly that you not do what you are planning. I have been building HHblits-like databases of PDB structures on a monthly basis since 2005. Back then there were other tools to gather and align members, but eventually I switched the whole thing to HHblits. This database has over 100,000 HMMs and gets 300-400 new members each month. Just a monthly update is a fairly large undertaking that requires a lot of computer time and a fairly large RAM. I can't imagine doing it from the scratch on anything smaller than a super-cluster, and it would still take many months. Besides, HHsuite already has such a database based on PDB structures and clustered at 70% identity:

http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/

The latest version is from Nov 17th, 2021 so it isn't even a week old. Please don't take offense, but I can't imagine that you would do a better job at it than HHsuite authors, or that you can dedicate more resources to it than what they already do.

If you still want to go through with this - again, I don't think you should - you may want to consider a different order of steps. To your question #1, I don't think you need to download the whole PDB database - you would be looking at ~180,000 files that are protein structures. This is because there is a huge redundancy in protein structures. There are ways to download all protein sequences of PDB entries without downloading the structures.

https://ftp.wwpdb.org/pub/pdb/derived_data/

You want the file pdb_seqres.txt. Once you download it, I suggest you remove the redundancy at a sequence level before doing anything with structures. When that is done, it will give you only a relatively small number of structures to download and process. Keep in mind that this is very relative, because tens of thousands of structures is still a large number.

As to your question #2, PDB structures in most cases contain links to UniProt numbers, though I don't know of an automatic way to extract them. If you look at my favorite structure, you will see after scrolling down that this structure corresponds to this UniProt entry. That information is likely to be present both in PDB and CIF files and is simply a matter of parsing it out once you settle on a reasonable number of structures. My question to you is why would you want to ignore the mutants and link them to non-mutated UniProt entries? What matters ultimately is the protein sequence in the structure itself, because that is the only thing that can be used for modeling.

ADD COMMENT
0
Entering edit mode

Hello!

Many thanks for your advices and detailed feedback. I was not aware that database preparation for HHblits is that time consuming and computationally expensive.

I am not very familiar with MSA, and HH-suite. From the wiki page enter link description here, custom library preparation seems a little complicated and tricky for me. I considered first using directly PDB70 but here are my issues/what I am trying to do:

  • I have a list of target proteins for which I would like to search for homologous proteins with available 3D structures. I will not model my target proteins. I am only interested in sequence similarity search at a first stage. I can use PDB70 to search for homologous proteins to my target sequences, but in this case, I will have to modify the database to remove the target proteins. Thus, (my new) NPDB70 = PDB70 - target proteins.

  • For this, I downloaded the PDB70 from the link you shared with me. However, I am a little bit confused about the content of the different files. For example, I checked in the file pdb_filter.dat for one of my target proteins and the PDBid was there. However, when I checked the other files (db_cs219.ffindex, db_hhm.ffindex and db_a3m.ffindex) the PDBid was not there. Also the number of lines between these three files and the pdb_filter.dat are not the same. Do you have an idea why is it is like this?

  • I understand that a3m files contains the MSA of each sequence of protein in the database, hhm files are representation in hidden Markov models for each MSA. However, I am a little bit confused about the information that cs219 file contains? Do you know about this?

  • Now, if I want to modify the existing PDB70 and generate my NPDB70, I can do so by removing file entries from the ffindex files using ffindex_modify: ffindex_modify -s -u -f files.dat <db>_a3m.ffindex. Same command also for hhm.ffindex and cs219.ffindex. This deletes the file entries from the ffindex files, however the files are still in the ffdata file. This way HHblits and HHsuite won’t be able to use them. According to the wiki page, If we want to get rid of them in the ffdata file we may rebuild the databases. My question is the following: since these entries have been used to generate the different MSA and HHM profiles in the library, if I build the same database, with and without these entries, i would expect different final results of MSA and HHM profile, right? If this is the case, then, it is not enough in my case to just remove the entries but also to rebuild the database. Is it correct?
  • Regarding retrieving FASTA sequences from UniProt - here is my way of thinking: as I said before, I am interested in sequence similarity search to identify homologous proteins with available 3D structures to my target proteins. In this case I will use the evolutionary informations from the MSA. Thus, it is important to use the canonical sequences of the proteins that are in the PDB rather than their sequences from the experimental structures (which may carry engineered mutations). Unless, I am skipping something, this approach make sense to me. Can I have you opinion about this, please.

Many thanks in advance for your help.

ADD REPLY

Login before adding your answer.

Traffic: 1691 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6