How to download all sequences of a list of proteins for a particular organism
1
0
Entering edit mode
7.1 years ago
traviata ▴ 20

For every organism here https://www.ncbi.nlm.nih.gov/genome/browse/reference/#, I would like to download the protein sequence of all genes whose /product matches "RNA polymerase subunit" or "ribosomal protein" and put all sequences in a single fasta proteins.fa.

How can I do this using Entrez Direct utilities in terminal? I understand that I can parse the table in the link for the RefSeq IDs, but from there, I'm not sure where to go.

efetch esearch entrez direct • 3.4k views
ADD COMMENT
6
Entering edit mode
7.1 years ago

You can use this Python script that I wrote just now (only tested on Python 2.7).

import sys
import argparse
from Bio import Entrez

parser = argparse.ArgumentParser(description='Searches for protein sequences in the Title Word field ([TITL]) based on any provided key terms.\nSee here for further details: http://cbsu.tc.cornell.edu/resources/seq_comp/pb607_introductory/entrez/ncbi_entrez.html')
parser.add_argument('-e', action='store', dest='EmailAddress', required=True, help='Entrez requires your email address.')
parser.add_argument('-t', action='store', dest='SearchTerm', required=True, help='Requires a search term (wrap in double quotes).')

arguments = parser.parse_args()

Entrez.email = arguments.EmailAddress

SearchTerm = arguments.SearchTerm

#LookupCommand = "refseq[FILTER] AND txid9606[Organism] AND " + SearchTerm + "[TITL]"
LookupCommand = "refseq[FILTER] AND " + SearchTerm + "[TITL]"

handle = Entrez.esearch(db='protein', term=LookupCommand)

results = Entrez.read(handle)

handle.close()

#Lookup the FASTA sequence for each protein by its GeneInfo Identifier (GI) number
for gi in results['IdList']:
    handle = Entrez.efetch(db='protein', id=gi, rettype='fasta')

    print handle.read()

    handle.close()

Execute it with

python ProteinFASTASearchByFASTATitle.py -e Me@MyEmail.com -t "RNA polymerase subunit" > protein.fa

python ProteinFASTASearchByFASTATitle.py -e Me@MyEmail.com -t "ribosomal protein" >> protein.fa

sed -i '/^$/d' protein.fa

head protein.fa
>WP_098657443.1 RNA polymerase subunit sigma-70 [Bacillus toyonensis]
MNQSYSSLNRDESLTRTINLGTTARSIGPLVKPEDENFEVKEIWNYKVLSKQESLNLFRRYKHGEKDLRE
YLFHVNIGLVLSIARKYKKKHPEIEFDDLVQEGNEGMLRAIEDFDPDLGYCFSTYAYCWIKKSMLGFICK
KKSGPFKIPNYVNQFNVKYVEIEDKYLQMHNRIPTVEEVVKELDVTREKVVRHNVYYNWVTTMTLDIDTI
NEDIGILNSFCNDNSAIPSTNEMIMEDLNYEIWIIFDEVLNPKQKMVLNLCFGLLDGEIHLHKEIAKALM
ITTERVSQLKDEAINRLKKCDYKDEIFNLLHAKLKVMDELNMA
>WP_098657164.1 RNA polymerase subunit sigma-70 [Bacillus toyonensis]
MKPATFTETVVLYEGMIVNQIKKLSIYQDHEEYYQCGLIGLWYAYERYEEGKGSFPAYAVITVRGYILER
LKKECIMQERYVCVGEYDEQFESEETGMRAQDFMSVLNKRERHIISERFFVGKKMGEIACEMGMTYYQVR
WIYRQALEKMRDSVKG

The final sed command just deletes empty lines, which the entrez fetch command produces.

This script searches for your term in the [TITL] field in Entrez, which will contain the product name (see here: http://cbsu.tc.cornell.edu/resources/seq_comp/pb607_introductory/entrez/ncbi_entrez.html ). If you want just human sequences, then un-comment the #LookupCommand = "refseq[FILTER] AND txid9606[Organism] AND " + SearchTerm + "[TITL]" line in the script, and comment out the other line beneath it.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 1809 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6