Question

Representable, non-homologous, non-redundant protein database

0

Entering edit mode

2.9 years ago

ldpubsec ▴ 80

Hello, i'm examining properties of proteins secondary structures and for examining and presenting the results:

What non-redundant (or non-homologous) protein database would you recommend?

I've obtained a list of PDB ids from PISCES (https://dunbrack.fccc.edu/pisces/) and obtained the protein details from UniProt, however i'm not sure whether this is the best way?

Thank you in advance

non-homologous representable non-redundant database proteins • 1.8k views

ADD COMMENT • link updated 10 weeks ago by pduda • 0 • written 2.9 years ago by ldpubsec ▴ 80

score 1 · Answer 1 · 2021-12-11

1

Entering edit mode

2.9 years ago

lieven.sterck 15k

UniRef

there are various datasets in that one, read the intro to check which one is most appropriate for your case.

ADD COMMENT • link 2.9 years ago by lieven.sterck 15k

0

Entering edit mode

Thank you! So i've extracted UniprotKB ids from UniRef and used them as a filter for UniprotKB Reviewed and experimentally verified protein DB with the highest annotation score. Actually, nearly none (few) of the proteins were filtered out. For me, it makes sense that experimentally verified proteins in UniprotKB should be unique to save time and money and not to waste it on some homologs. But is it like this? The same also happened with the PISCES filter... or am i doing something wrong? ty

ADD REPLY • link 2.9 years ago by ldpubsec ▴ 80

1

Entering edit mode

since you're after non-redundant things, I would have gone for UniRef100 (or UniRef90, depending on how stringent/lenient you want to be)

UniprotKB will be fairly non-redundant as well indeed (it could/will still contain homologous sequences though)

ADD REPLY • link 2.9 years ago by lieven.sterck 15k

0

Entering edit mode

if i understand it right, 90 % means that sequences that are similar at least from 90 % are clustered together (i.e. counted as "the same"), thus using UniRef50 should enforce more variety (less redundancy) than UniRef90, if this is right, what would be the reason for the UniRef90 rather than for UniRef50? (E.g. RaptorX uses or recommends -- as far as i understand -- UniRef30 https://uniclust.mmseqs.com/ from https://github.com/j3xugit/RaptorX-3DModeling/ ), ty

ADD REPLY • link 2.9 years ago by ldpubsec ▴ 80

score 1 · Answer 2 · 2021-12-11

It depends whether you are interested only in protein sequences for which PDB structures are available, or protein sequences in general. Your mention of PISCES makes me think the former.

I haven't looked at PISCES in at least 4-5 years, so I am not sure if they fixed the issue I will write about below. Their list of non-redundant sequences does not always deliver on the presumed sequence identity cutoffs. When you select, for example, a 40% identity cutoff, none of the sequences that remain in the list should have higher than 40% identity. In my experience that didn't use to be the case, and they would still have a good number of redundant sequences left. I have looked at other papers that have manually curated their non-redundant lists, and some of them are better than others.

In the end, I made my own non-redundant lists by using PDBFINDER, but that program is poorly supported and very slow, so I don't necessarily recommend it.

It is not that difficult to make a non-redundant list based on protein sequences - simply go here and download pdb_seqres.txt. Make sure only to use the sequences that have mol:protein in their headers. Depending on the redundancy cutoff desired, either CD-HIT or MMseqs2 will easily cut down that file to something like 5000-10000 sequences. The problem comes from chain breaks, residue modifications, missing residues in general, or poorly resolved structures - none of these are recorded in those simple sequence files. It takes a long time to do this right, so you may want to pick a reputable recent paper that shares sequences used for training and testing, and simply re-use their datasets.