Using SwissProt vs NR database for PSSM generation
1
0
Entering edit mode
4 months ago
jet • 0

I am wondering if it is possible to use SwissProt as a substitute for NR database for PSSM generation, that is given a protein sequence, I want to generate a PSSM matrix using PSI-BLAST. The reason I am asking is because I am trying to develop a solution that ideally can be run locally in my machine without the need of a compute cluster, so the smaller the database the better.

However, I am concerned about a few questions:

  1. Is SwissProt too small for PSSM generation? SwissProt has about 500k sequences, while NR has about ~500M.
  2. Will PSSM generated by SwissProt be biased in any way?
PSSM SwissProt NR • 290 views
ADD COMMENT
1
Entering edit mode
4 months ago
  1. Ideally you want a non-redundant sequence set for this task, that's why NR is the default choice for many applications. A smaller alternative would be UniRef50, which still takes about 25 GB (see https://www.biostars.org/p/9591499/#9591570). If computer resources are the problem you could even run PSI-BLAST remotely (see https://www.biostars.org/p/44096/), but of course that takes some time for each query.

  2. SwissProt is sparse as it curates some organisms much deeper than others. For instance, among plants, it contains ~16K reviewed proteins from model Arabidopsis thaliana, while it contains only 46 of Brachypodium distachyon, another model. This will taxonomically bias sequence searches of SwissProt. However, according to https://www.uniprot.org/help/redundancy, SwissProt contains "one record per gene in one species", so it is non-redundant in that sense. That's probably why some tools can use it to compute PSSMs, see for instance https://toolkit.tuebingen.mpg.de/tools/hmmer.

ADD COMMENT

Login before adding your answer.

Traffic: 2406 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6