How to retrieve sets of protein sequences?
3
1
Entering edit mode
6.1 years ago
Learner ▴ 280

I have a set of protein accession number. I want to retrieve their sequences but programmically. Is there anyway to do that from Uniprot?

There is a page description on Uniprot itself but to be honest I could not understand it.

Any comment would be appreciated

Thanks

uniprot • 5.8k views
ADD COMMENT
0
Entering edit mode

There is a page description on Uniprot itself but to be honest I could not understand it

Please link to the page and specify exactly what you do not understand.

ADD REPLY
4
Entering edit mode
6.1 years ago
vkkodali_ncbi ★ 3.8k

You can get it from UniProt directly using curl as follows:

$ cat uniprot_ids.txt 
P00750
P00751
P00752

$ for acc in `cat uniprot_ids.txt` ; do curl -s "https://www.uniprot.org/uniprot/$acc.fasta" ; done > uniprot_seqs.fasta

But if you choose to go with Entrez Direct, then I suggest the following command:

$ cat uniprot_ids.txt | epost -db protein | efetch -db protein -format fasta > uniprot_seqs.fasta
ADD COMMENT
1
Entering edit mode

The curl command from vkkodali_ncbi to download from uniprot no more works. I used the following to download the sequences:

time for acc in `cat id.list`; 
do 
curl "https://rest.uniprot.org/uniprotkb/$acc.fasta"; 
done > uniprot_seqs.fasta

Note: For some proteins like K2C1_HUMAN, curl does not download the fasta because the curl link changes from

https://rest.uniprot.org/uniprotkb/K2C1_HUMAN.fasta

to

https://rest.uniprot.org/uniprotkb/P04264.fasta?from=K2C1_HUMAN
ADD REPLY
0
Entering edit mode

@vkkodali how can I know which proteins sequences are downloaded and which ones are not? also should I install epost and efetch ? because it gives me an error

ADD REPLY
0
Entering edit mode

how can I know which proteins sequences are downloaded and which ones are not?

You will have to use a unix tool like grep to check which IDs are in the fasta file and which ones are missing.

should I install epost and efetch ?

If you have followed the instructions at http://bit.ly/entrez-direct then you should have access to all 9 of the Entrez Direct tools. Make sure you have the edirect tools in your path (these are the last two export commands in the installation instructions). You will have to logout of the terminal and log back in for that to take effect.

Finally, if you see errors related to API keys, see https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/ to create your own API key and use it with the command line tools by setting an environment variable NCBI_API_KEY as follows at the bash prompt:

export NCBI_API_KEY=12345
ADD REPLY
0
Entering edit mode

@vkkodali actually I am using Uniprot which is what I am interested in , however, it is killing me, for example check for these gene P21333 O43707 P68363

you see that they exist in Uniprot but when I use your approach , I get nothing. do you think the format of .txt would influence ?

also is there a possibility to check all gene list to the fasta file with grep?

ADD REPLY
0
Entering edit mode

I am able to download the FASTA sequences for all three of those IDs without an issue. How did you create the uniprot_ids.txt file? If you have created it on Windows then you may want to run dos2unix on that file to make sure it's not the line endings that are mucking things up.

For checking whether all of the IDs from uniprot_ids.txt are present in unipro_seqs.fasta file, you can run the following command:

comm -23 <(sort uniprot_ids.txt ) <(grep '^>' uniprot_seqs.fasta | cut -f2 -d '|' | sort )

All accessions that are present in your original list and not in the FASTA file will be returned.

ADD REPLY
0
Entering edit mode

@vkkodali I found where the issue was. your solution is the best solution.

I think if I want to extract the IDs from the fasta file, I can simple do the following right ?

grep '^>' uniprot_seqs.fasta | cut -f2 -d '|' | sort
ADD REPLY
0
Entering edit mode

If you just want to extract the IDs and not care about the order they are in then you can skip the sort at the end. Also, if you skip the sort at the end then the IDs will be returned in the same order they appear in the FASTA file.

ADD REPLY
0
Entering edit mode

running this from my command line:

 for acc in `cat uniprot_ids.txt` ; do curl -s "https://www.uniprot.org/uniprot/$acc.fasta" ; done > uniprot_seqs.fasta

does not actually do anything :s I have the file saved exactly as listed. it just stays in one state.

ADD REPLY
0
Entering edit mode

What do you mean it does not do anything? Please try it without the -s parameter for curl. I just tried it and I see 3 FASTA sequences in the uniprot_seqs.fasta file.

ADD REPLY
0
Entering edit mode

It was not executing since I am using a fish shell, I needed to add the bash command in front in order for it to run:

bash -c 'for acc in `cat protein_ids.txt` ; do curl -s "https://www.uniprot.org/uniprot/$acc.fasta"; done > uniprot_seqs.fasta'
ADD REPLY
1
Entering edit mode

Yup, that will do it. I never used fish myself but I have seen issues like these come up when I tried to use bash loops in cshell. Glad to hear that it's working for you now.

ADD REPLY
1
Entering edit mode
6.1 years ago
GenoMax 148k

Simplest solution may be to use NCBI's unix utils. Pass in your ID's (example below are UniProt ID's) one at a time or a batch as follows.

$ efetch -db protein -id "P00750,P00751,P00752" -format fasta
ADD COMMENT
0
Entering edit mode
6.1 years ago
piyushjo ▴ 710

Ensembl biomart.

It takes any sort of id: Refseq, HGNC, or just gene name.

http://useast.ensembl.org/biomart/martview/bfc7092adedc70231fd4027a5b8eaaed

  1. Choose data set (Ensembl v94)
  2. Filters: Gene name, ensemble id, refseq id or anything
  3. Attributes: In features choose "gene id" or "gene name" and then in sequences choose "peptide"
  4. hit results
ADD COMMENT
0
Entering edit mode

It's Ensembl, there's no e at the end.

ADD REPLY

Login before adding your answer.

Traffic: 2107 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6