Question

How to BLAST more than 5 sequences against UniProt database?

0

Entering edit mode

7 months ago

Riq ▴ 50

I have more than 100 sequences which I want to BLAST against UniProt database. However, the web version of the BLAST (https://www.uniprot.org/blast) limits the number of sequences to only 5. In such a case, is there a programmatic way to BLAST the UniProt database or other approach to BLAST many sequences at once?

BLAST Protein UniProt • 1.4k views

ADD COMMENT • link 7 months ago by Riq ▴ 50

1

Entering edit mode

If you are willing to forgo trembl part then you could use swissprot at NCBI protein web blast.

ADD REPLY • link 7 months ago by GenoMax 147k

0

Entering edit mode

Thanks! This is a fast way to check if manually annotated entries are enough as database.

ADD REPLY • link 7 months ago by Riq ▴ 50

1

Entering edit mode

7 months ago

Elisabeth Gasteiger ★ 2.4k

If you cannot use local BLAST as suggested above (which may be an excellent alternative, be it with UniProtKB, UniRef or just UniProtKB/Swiss-Prot), UniProt recommends programmatic access as described in this EBI help page: https://ebi-biows.gitdocs.ebi.ac.uk/documentation/webservices/

Go to the "Sequence similarity search" section and select NCBI BLAST+.

ADD COMMENT • link 7 months ago by Elisabeth Gasteiger ★ 2.4k

score 4 · Accepted Answer · 2024-04-02

4

Entering edit mode

7 months ago

b.contreras.moreira ▴ 310

For any number of query sequences, if you have room in your hardrive and you are willing to go command-line, you can download UniRef FASTA files from https://ftp.uniprot.org/pub/databases/uniprot/current_release/uniref, format them with makeblastdb and run BLASTP/BLASTX against them with something like:

makeblastdb -in uniref50.fasta -dbtype prot
blastp -query input.fasta -db /path/to/uniref50.fasta

Note that UniRef sets contain clusters of UniProt and Uniparc sequences, read more at https://www.uniprot.org/help/uniref. They are available with 100%, 90% or 50% redundancy cutoffs, and the corresponding compressed FASTA files take 99GB, 43GB and 12GB respectively.

ADD COMMENT • link 7 months ago by b.contreras.moreira ▴ 310

1

Entering edit mode

@b.contreras.moreira Uncompressed UniRef100 FASTA file size is 943 GB, which is way beyond my local computer storage and uncompressed UniRef50 FASTA file is 24.4 GB. I realized that my sequences are all supposed to be Human proteins and therefore I downloaded the human proteome (Swiss-Prot + TrEMBL) from UniProt and ran BLAST locally, which is more efficient way.

ADD REPLY • link 7 months ago by Riq ▴ 50

0

Entering edit mode

Thanks, it is very helpful. Will there be a difference in final output (E-score, Percent Identity, Query cover) if UniRef50 is used instead of UniRef100 other than faster sequence similarity searches?

ADD REPLY • link 7 months ago by Riq ▴ 50

0

Entering edit mode

Sequences within a UniRef50 cluster share at least 50% sequence identity with each other. They group together a broader range of protein sequences, including more distant homologs. UniRef100 clusters contain sequences with a sequence identity of 100%. These clusters are more focused and tend to represent closely related sequences. Hence, the ouput will change depending on which cluster you are using.

ADD REPLY • link 7 months ago by atharvakarkare14 ▴ 40

0

Entering edit mode

Will there be a difference in final output (E-score, Percent Identity, Query cover) if UniRef50 is used instead of UniRef100

Yes. Since the database content is going to be different as shown by the size differences noted above. Blast depends on the database contents to generate statistics and alignments.

ADD REPLY • link 7 months ago by GenoMax 147k