This Python script will extract FASTA protein records relating to your IDs of the form Q66LE6, Q9UKV3, etc. This assumes that everything is human. You can also search using 2ABD, ACINU, etc., but it returns more hits
Only tested on Python 2.7
import sys
import argparse
from Bio import Entrez
parser = argparse.ArgumentParser(description='Searches for a human protein sequence by any provided ID or accession number.')
parser.add_argument('-f', action='store', dest='SearchTerms', required=True, help='The column number containing the search terms in the provided file (starting at 1).')
parser.add_argument('-e', action='store', dest='EmailAddress', required=True, help='Entrez requires your email address.')
parser.add_argument('InputFile', help='Input file')
arguments = parser.parse_args()
Entrez.email = arguments.EmailAddress
iSearchTerm_Col = int(arguments.SearchTerms) - 1
with open(arguments.InputFile, 'r') as InputFile:
for line in InputFile:
LookupTerm = line.split()[iSearchTerm_Col]
LookupCommand = 'refseq[FILTER] AND txid9606[Organism] AND {}'.format(LookupTerm)
handle = Entrez.esearch(db='protein', term=LookupCommand)
results = Entrez.read(handle)
handle.close()
#Lookup the FASTA sequence for each protein by its GeneInfo Identifier (GI) number
for gi in results['IdList']:
handle = Entrez.efetch(db='protein', id=gi, rettype='fasta')
print handle.read()
handle.close()
Execute it as follows: python ProteinSearch.py -f 1 -e myemail@gmail.com proteinsearch.list
proteinsearch.list contains a single list of your IDs:
Q66LE6
Q9UKV3
...
..
">NP_060931.2 serine/threonine-protein phosphatase 2A 55 kDa regulatory subunit B delta isoform isoform a [Homo sapiens]
MAGAGGGGCPAGGNDFQWCFSQVKGAIDEDVAEADIISTVEFNYSGDLLATGDKGGRVVIFQREQENKSR
PHSRGEYNVYSTFQSHEPEFDYLKSLEIEEKINKIRWLPQQNAAHFLLSTNDKTIKLWKISERDKRAEGY
NLKDEDGRLRDPFRITALRVPILKPMDLMVEASPRRIFANAHTYHINSISVNSDHETYLSADDLRINLWH
LEITDRSFNIVDIKPANMEELTEVITAAEFHPHQCNVFVYSSSKGTIRLCDMRSSALCDRHSKFFEEPED
PSSRSFFSEIISSISDVKFSHSGRYMMTRDYLSVKVWDLNMESRPVETHQVHEYLRSKLCSLYENDCIFD
KFECCWNGSDSAIMTGSYNNFFRMFDRDTRRDVTLEASRESSKPRASLKPRKVCTGGKRRKDEISVDSLD
FNKKILHTAWHPVDNVIAVAATNNLYIFQDKIN
">NP_001158286.1 apoptotic chromatin condensation inducer in the nucleus isoform 2 [Homo sapiens]
MWRRKHPRTSGGTRGVLSGNRGVEYGSGRGHLGTFEGRWRKLPKMPEAVGTDPSTSRKMAELEEVTLDGK
PLQALRVTDLKAALEQRGLAKSGQKSALVKRLKGALMLENLQKHSTPHAAFQPNSQIGEEMSQNSFIKQY
LEKQQELLRQRLEREAREAAELEEASAESEDEMIHPEGVASLLPPDFQSSLERPELELSRHSPRKSSSIS
EEKGDSDDEKPRKGERRSSRVRQARAAKLSEGSQPAEEEEDQETPSRNLRVRADRNLKTEEEEEEEEEEE
EDDEEEEGDDEGQKSREAPILKEFKEEGEEIPRVKPEEMMDERPKTRSQEQEVLERGGRFTRSQEEARKS
HLARQQQEKEMKTTSPLEEEEREIKSSQGLKEKSKSPSPPRLTEDRKKASLVALPEQTASEEETPPPLLT
KEASSPPPHPQLHSEEEIEPMEGPAPPVLIQLSPPNTDADTRELLVSQHTVQLVGGLSPLSSPSDTKAES
PAEKVPEESVLPLVQKSTLADYSAQKDLEPESDRSAQPLPLKIEELALAKGITEECLKQPSLEQKEGRRA
SHTLLPSHRLKQSADSSSSRSSSSSSSSSRSRSRSPDSSGSRSHSPLRSKQRDVAQARTHANPRGRPKMG
SRSTSESRSRSRSRSRSASSNSRKSLSPGVSRDSSTSYTETKDPSSGQEVATPPVPQLQVCEPKERTSTS
SSSVQARRLSQPESAEKHVTQRLQPERGSPKKCEAEEAEPPAATQPQTSETQTSHLPESERIHHTVEEKE
EVTMDTSENRPENDVPEPPMPIADQVSNDDRPEGSVEDEEKKESSLPKSFKRKISVVSTKGVPAGNSDTE
GGQPGRKRRWGASTATTQKKPSISITTESLKEAVVDLHADDSRISEDETERNGDDGTHDKGLKICRTVTQ
VVPAEGQENGQREEEEEEKEPEAEPPVPPQVSVEVALPPPAEHEVKKVTLGDTLTRRSISQQKSGVSITI
DDPVRTAQVPSPPRGKISNIVHISNLVRPFTLGQLKELLGRTGTLVEEAFWIDKIKSHCFVTYSTVEEAV
ATRTALHGVKWPQSNPKFLCADYAEQDELDYHRGLLVDRPSETKTEEQGIPRPLHPPPPPPVQPPQHPRA
EQREQERAVREQWAEREREMERRERTRSEREWDRDKVREGPRSRSRSRDRRRKERAKSKEKKSEKKEKAQ
EEPPAKLLDDLFRKTKAAPCIYWLPLTDSQIVQKEAERAERAKEREKRRKEQEEEEQKEREKEAERERNR
QLEREKRREHSRERDRERERERERDRGDRDRDRERDRERGRERDRRDTKRHSRSRSRSTPVRDRGGRR
Hello Jason, can you give more precisions? You want the protein sequences from PDB, NCBI, ENSEMBL, UNIPROT, ... ? Several databases have API which allow you to extract some data using ID as entry point. ;)