Representable, non-homologous, non-redundant protein database
3
0
Entering edit mode
2.9 years ago
ldpubsec ▴ 80

Hello, i'm examining properties of proteins secondary structures and for examining and presenting the results:

What non-redundant (or non-homologous) protein database would you recommend?

I've obtained a list of PDB ids from PISCES (https://dunbrack.fccc.edu/pisces/) and obtained the protein details from UniProt, however i'm not sure whether this is the best way?

Thank you in advance

non-homologous representable non-redundant database proteins • 1.8k views
ADD COMMENT
1
Entering edit mode
2.9 years ago

UniRef

there are various datasets in that one, read the intro to check which one is most appropriate for your case.

ADD COMMENT
0
Entering edit mode

Thank you! So i've extracted UniprotKB ids from UniRef and used them as a filter for UniprotKB Reviewed and experimentally verified protein DB with the highest annotation score. Actually, nearly none (few) of the proteins were filtered out. For me, it makes sense that experimentally verified proteins in UniprotKB should be unique to save time and money and not to waste it on some homologs. But is it like this? The same also happened with the PISCES filter... or am i doing something wrong? ty

ADD REPLY
1
Entering edit mode

since you're after non-redundant things, I would have gone for UniRef100 (or UniRef90, depending on how stringent/lenient you want to be)

UniprotKB will be fairly non-redundant as well indeed (it could/will still contain homologous sequences though)

ADD REPLY
0
Entering edit mode

if i understand it right, 90 % means that sequences that are similar at least from 90 % are clustered together (i.e. counted as "the same"), thus using UniRef50 should enforce more variety (less redundancy) than UniRef90, if this is right, what would be the reason for the UniRef90 rather than for UniRef50? (E.g. RaptorX uses or recommends -- as far as i understand -- UniRef30 https://uniclust.mmseqs.com/ from https://github.com/j3xugit/RaptorX-3DModeling/ ), ty

ADD REPLY
1
Entering edit mode
2.9 years ago
Mensur Dlakic ★ 28k

It depends whether you are interested only in protein sequences for which PDB structures are available, or protein sequences in general. Your mention of PISCES makes me think the former.

I haven't looked at PISCES in at least 4-5 years, so I am not sure if they fixed the issue I will write about below. Their list of non-redundant sequences does not always deliver on the presumed sequence identity cutoffs. When you select, for example, a 40% identity cutoff, none of the sequences that remain in the list should have higher than 40% identity. In my experience that didn't use to be the case, and they would still have a good number of redundant sequences left. I have looked at other papers that have manually curated their non-redundant lists, and some of them are better than others.

In the end, I made my own non-redundant lists by using PDBFINDER, but that program is poorly supported and very slow, so I don't necessarily recommend it.

It is not that difficult to make a non-redundant list based on protein sequences - simply go here and download pdb_seqres.txt. Make sure only to use the sequences that have mol:protein in their headers. Depending on the redundancy cutoff desired, either CD-HIT or MMseqs2 will easily cut down that file to something like 5000-10000 sequences. The problem comes from chain breaks, residue modifications, missing residues in general, or poorly resolved structures - none of these are recorded in those simple sequence files. It takes a long time to do this right, so you may want to pick a reputable recent paper that shares sequences used for training and testing, and simply re-use their datasets.

ADD COMMENT
0
Entering edit mode

Thank you for the information. Yes, i'm working with secondary strucutres of well annotated proteins which i obtain from UniProt. i've extracted the sequences and then let them filter through 0.90 CD HIT; this led to yet the largest filtering out in comparison to PISCES or UniRef50 and also led to better results. Now i'm trying the MMseqs2 -- as also lieven.sterck mentioned -- what score should i use for -min-seq-id and -c? I'll also check the training papers, that sounds as great idea -- does any particular article come into your mind? Ty

ADD REPLY

Login before adding your answer.

Traffic: 2579 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6