Filtering AA sequences of PDB structures by sequence identity
1
1
Entering edit mode
2.5 years ago

Dear all,

I came across a problem that I thought would be easier to solve. What I want to do is accessing the PDB and download structures associated with a certain protein family. I then want to align those structures. So far so easy. BUT of course for some members of the protein family there is several structures, of which I want only one. It doesn't matter which one, but it has to be a unique set of members of the protein family. I am not posting my code here because this is not where I need help. I would need advise to a approach that is feasible and biologically acceptable, if that makes sense.

My approach so far:

  • Query PDB for structures associated with protein family. This gives me a list of PDB IDs
  • Download Fasta files associated with PDB IDs
  • Construct MSA using Clustalo
  • calculate pairwise sequence identity for each pair of sequences in the alignment --> yields a numpy array with 0 --> 1 for sequence identity between pairs
  • filter array by cutoff set to 0.95 (yes some members of the family have very high sequence similarity) and delete everything higher than 0.95 sequence identity
  • extract indices from array to use as unique set

Here is the problem: I significantly reduce the number of sequences, but I end up with some duplicates, because sometimes the sequences of the PDB entries of the same protein have different length etc., leading to lower sequence identity. Meaning I miss those.

If anyone has a smart idea to filter a list of PDBs to obtain one unique structure for every protein, please don't hesitate to post it here. Any input is highly welcome!

identity python sequence • 1.1k views
ADD COMMENT
0
Entering edit mode

Couple of thoughts that may be in some way useful:

  1. Can you change your alignment process/filter your sequences such that you require at least one gap/mismatch so you never obtain an exact duplicate? (Not sure if this will affect your downstream task). It sounds like %ID isn't sufficient alone for this.
  2. Maybe clustering via CD-HIT or doing reciprocal alignments would be a better alternative? The perfect substring matches should identify as high scoring in at least one 'direction' of the search?
  3. Do some pre-filtering set manipulations to obtain just the longest sequence for any given PDB ID, so you don't have any duplicate PDB IDs with shorter seqs?
ADD REPLY
2
Entering edit mode
2.5 years ago
Mensur Dlakic ★ 28k

I think your solution is sound in general, but it doesn't strike me as very efficient. As you discovered, it is also prone to some issues due to different lengths of otherwise similar structures.

I suggest you start with all FASTA sequences extracted from PDB structures. Those are in the following file (current as of July 1st, 2022):

https://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt

You can use that file to perform a BLAST search from a given sequence or a set of sequences, which will give you a set of PDB ID matches. If you already know all the IDs, you can skip this part.

People are continuously culling PDB sequences using various quality and sequence identity criteria. Those datasets can be found on PISCES server:

http://dunbrack.fccc.edu/pisces/

If you click on "Download precompiled ..." link, there will be many files to choose from based on sequence identity you wish and various quality criteria.

Let's take this file:

cullpdb_pc50.0_res0.0-1.5_noBrks_len40-10000_R0.25_Xray_d2022_06_19_chains4865.fasta

This means all protein chains with length 40-10000 residues (len40-10000), without any breaks (noBrks), resolution 0-1.5 angstroms (res0.0-1.5), R factor =< 0.25 (R0.25), and with at most 50% identity (pc50.0). There are 4865 such protein chains, and you can compare the list of IDs obtained in the first step to this group, which will automatically guarantee in this case that your protein sequences are at most 50% identical, plus all the other quality criteria. It is up to you to select a group of sequences that best matches your needs.

ADD COMMENT

Login before adding your answer.

Traffic: 1930 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6