Question

Download all bacterial proteins from the same family

0

Entering edit mode

9.5 years ago

boudica5 • 0

Hello, I need to construct a database containing all protein sequences belonging to the same pfam family. I used blastp to retrieve them but I would like to know if it could be done in one step instead of doing an endless job of downloading every sequence that aligns to the query. Thank you very much in advance.

blast • 2.9k views

ADD COMMENT • link updated 9.5 years ago by Andrzej Zielezinski 11k • written 9.5 years ago by boudica5 • 0

1

Entering edit mode

If you did the blast at NCBI site there is an option to download all matching sequences (in a variety of formats) by scrolling down to the descriptions section on the blast results page, selecting any (or all) hits and then choosing the Download button and format you need.

ADD REPLY • link 9.5 years ago by GenoMax 153k

0

Entering edit mode

Thank you! The problem is that the matching sequences for my query are not all the sequences belonging to the same family. I mean, if I select a different query and blast it, something like 100 new sequences appear also with the others obtained with the previous query. Then I do not know how to include ONLY those new and avoid those obtained with the first matching.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.5 years ago by boudica5 • 0

0

Entering edit mode

Are you referring to psi-blast by any chance? If so you may want to try delta-blast. That can save you time and also avoid false positives.

ADD REPLY • link 9.5 years ago by GenoMax 153k

0

Entering edit mode

No, it is blastp by default. When I send a biochemically characterized protein sequence, it gives me a total of 6500 sequences and when I send a different sequences also characterized it gives me 6700. Thus, if a merge 6500+6700 there are a lot of duplicates that I need to remove.

ADD REPLY • link 9.5 years ago by boudica5 • 0

Ram · Answer 1 · 2016-02-03

4

Entering edit mode

9.5 years ago

Andrzej Zielezinski 11k

If you need to retrieve all proteins for only one or few PFAM families:

Go to Pfam.
Click on Browse.
From a list of families, choose one you are intersted in. For example R3H (PF01424).
Click on Species.
Click on the Tree tab.
Select Eukaryota, Bacteria or whatever you want.
Click Download sequence as FASTA format.

If you want to get all proteins for all protein PFAM families.

Go to Pfam FTP.
Click current_release.
Download two files: uniprot_sprot.dat.gz and uniprot_trembl.dat.gz.
Unpack the files and parse UniProt protein records (using Python or Perl or whatever you want) to retrieve Pfam families they belong to.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 9.5 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

I appreciate your response. With the first option you gave me, I obtained a total of 450 sequences, however from NCBI, the sequences matching or containing the domain of the pfam family are almost 6500 sequences. It is the total number of sequences that I am interested in...any suggestion? Thank you!

ADD REPLY • link 9.5 years ago by boudica5 • 0

1

Entering edit mode

NCBI is going to have a number of redundant entries so don't go on that number.

As for the example @a.zielezinski posted above I see 805 sequences (they are just the domain though not full length protein) once you select all bacteria. If you want full length protein sequences then you would need to do some additional work.

ADD REPLY • link 9.5 years ago by GenoMax 153k

0

Entering edit mode

I do not see clearly how pfam works. Extrapolating the example of @a.zielezinski to the family I´m interested in, I found short sequences that do not include the domain that characterizes the family. If it does not take an excess of your time, I will be glad to hear some points that would help me to get full lenght protein sequences

ADD REPLY • link 9.5 years ago by boudica5 • 0

1

Entering edit mode

I am assuming that you have downloaded the files with short sequences by following @a.zielezinski's direction and they are in motif.fa (adjust file names as needed).

Extract the Uniprot ID's

$ grep "^>" motif.fa | awk -F ">" '{print $2}' | awk -F "/" '{print $1}' > uniprot_ID_you_want

Grab the Uniprot fasta format sequences files from: http://www.uniprot.org/downloads (get both TrEMBL/SP).
Download faSomeRecords utility from Jim Kent (linux link included but os x version also available): http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64.v287/faSomeRecords
```
$ chmod u+x faSomeRecords
```
Use faSomeRecords utility to extract full-length sequences for the ID's you want from the two Uniprot files.

NOTE: This may or may not work since the trembl file is large (26G). In that case some other option would be needed.

$ ./faSomeRecords uniprot_ID_you_want uniprot_trembl.fasta sequence_you_need_trembl
$ ./faSomeRecords uniprot_ID_you_want uniprot_sprot.fasta sequence_you_need_sprot

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.5 years ago by GenoMax 153k