Hard to automatically download NCBI protein IDs
2
0
Entering edit mode
3 months ago
The ▴ 180

Hello,

I've a few NCBI protein ids which are really hard to download with BatchEntrez or NCBI "datasets" for example

MBN1612259.1 MBF9061340.1

Though the seem to exist in good shape:

https://www.ncbi.nlm.nih.gov/protein/MBN1612259.1 https://www.ncbi.nlm.nih.gov/protein/MBF9061340.1

However when trying to retrieve by BatchEntrez , getting the following error:

Error Message

Can anyone tell me how to download such sequences automatically?

Thanks in advance

NCBI batch-download • 718 views
ADD COMMENT
3
Entering edit mode
3 months ago
Mensur Dlakic ★ 28k

These commands work:

efetch -db protein -id MBN1612259.1 -format fasta
efetch -db protein -id MBF9061340.1 -format fasta
>MBN1612259.1 MAG: carbamoyl-phosphate synthase large subunit [Polyangiaceae bacterium]
MPRRDDIQKILLIGSGPIVIGQACEFDYSGTQGAKALVGLGYDVVLVNSNPATIMTDPELVRRTYVEPLE
VNTLAAIVARERPDALLPTLGGQTALNLALELHESGVLSQHGVQLIGAQVEAIRKAEDRQLFKDAMAHAG
LECPKSGYARSGEEARDIAVLTGYPLILRPSFTLGGAGGSVVDGPEQLEERVQWALAQSPTREVLVEESV
IGWKEFELEVMRDRADNFIVVCSIENIDPMGVHTGDSITVAPAMTLTDREYQCLRDASRAVMHEIGVETG
GSNVQFAVDPKTGRVLVIEMNPRVSRSSALASKATGYPIAKVAAKLAVGFTLDELENDITGTSAAFEPTI
DYIVVKWPRFAFEKFPGSDPRLGTQMKSVGEVMSIARTFPQALQKAARSLETGKDGLTSLFGRIDYVSMA
AQRTDKRDLALEPPPAARPRPSRAPGLRDEMARALRAIVGTPTAERLFHVADAIRLGVSIDELAQLTGID
SWFLGQIDRIVQHERVLREAPELDRVLLWESKRLGFSDRQIARLRDTDEAAVRLQREQAGIGTVYQRVDT
CAAEFVARTPYLYSSYETDTESEVSERRKVIILGGGPNRIGQGIEFDYCCCHAVFALRELGYETVMVNCN
PETVSTDYDTSDRLYFEPLTLEDVLAVCKEEASRGELVGVIVQFGGQTPLKLAVPLERAGVRLLGTSADA
IDRAEDRERFDALLNKLGLLRPRAGIAKSLDEARWVVGDIGYPVLVRPSYVLGGRAMMICWSDEELDAYV
GLALEAAQDEDSSPTLLIDEFLKHAIEVDVDCIADGHRAVVGGVMQHIEEAGIHSGDSTCVLPPHSLEPS
VVAEIEAQAKALALELGVVGLMNTQFAVKDGRVYVLEVNPRASRTVPFVSKATGRPLAKIAAQLMVGKTL
DELELRDLPQPTHVAVKESVLPFAKFPGVDTILGPEMRSTGEVMGVATSMPLAFSKSLLSAGSKLPARGR
AFISVTDEDKPAACYVAMHLRNVGFTVVATDGTADALTRARIPAVRINKVRQGSPHIVDAIKSGSIQLVI
NTAREASAIRDSYAIRRHAVLGNIPYFTTMSAALAVVESLEAHTLLDSRSVPVRSVQEWHARARRAARPE
>MBF9061340.1 Na(+)-translocating NADH-quinone reductase subunit A, partial [Rhodobacterales bacterium HKCCSP123]
ERPAAILVMGCDTRPLAPHPAEALAGREEALSRGLQALTGLTEGPVFLCDDAVRPLGVTVPGVRIVATVA
RHPQGLAGIRIAALCPAEIDHPVWDLDAEDVADLGDLLATGHLPQRRVVRVAGPALTETRLVTCQIGADT
RGLSYGAIRPGPHVILAGSPVDGRPAHWLGPRDRQVTVMDAAPRAAAPHWFLAALTRSSRPRPLIPSAAV
MQAAGGAFPAMAMLRALGAGDDETALKLGALSLLEEDLALVDYVTGGRPRAAELLRALLDRTAAEAGQ
ADD COMMENT
0
Entering edit mode

Thank you very much. For Efetch , is there a way I can pass a file with multiple IDs as input ?

ADD REPLY
1
Entering edit mode

You could use epost method for multiple inputs (in this case it does not appear to work for some reason with MAG ID's you have). So write a for loop to pass the ID's to efetch one at a time.

ADD REPLY
0
Entering edit mode

is there any limit to download by efetch in one particular session?

ADD REPLY
0
Entering edit mode

How many ID's do you have? As long as the number is not in thousands you should be fine. But if you have a ton of them then consider using an alternate method such as extracting them from nr database.

ADD REPLY
1
Entering edit mode

If you make a file called lst that contains all protein IDs, this command will save all FASTA sequences in sequences.fasta.

cat lst | xargs -i efetch -db protein -id {} -format fasta >> sequences.fasta
ADD REPLY
1
Entering edit mode
3 months ago
josev.die ▴ 70

If you are familiar with R :

#Dependencies
library(refseqR)

# Get the protein sequence using multiple ids. 
accession = c("MBN1612259.1", "MBF9061340.1")
my_aa <- refseq_AAseq(accession)

# Make a fasta file 
Biostrings::writeXStringSet(x= my_aa, filepath = "mypath/aa_result")

Then, you can just read the file .

ADD COMMENT

Login before adding your answer.

Traffic: 2007 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6