Download all peptide sequences from NCBI in fasta format?
1
0
Entering edit mode
5.0 years ago
Tom ▴ 50

I want to download in fasta format all the peptide sequences in the NCBI protein database (i.e. > and the peptide name, followed by the peptide sequence), I saw there is a MESH term describing what a peptide is here, but I can't work out how to incorporate it.

I wrote this:

import Bio
from Bio import Entrez

Entrez.email = 'test@gmail.com'
handle = Entrez.esearch(db="protein", term="peptide")
record = handle.read()
out_handle = open('myfasta.fasta', 'w')
out_handle.write(record.rstrip('\n'))

but it only prints out 995 IDs, no sequences to file, I'm wondering if someone could demonstrate where I'm going wrong.

biopython • 1.6k views
ADD COMMENT
0
Entering edit mode

genomax appears to have answered. You may also find a couple of my Python scripts of some use for this work that you are doing: https://github.com/kevinblighe/PythonScripts

ADD REPLY
2
Entering edit mode
5.0 years ago
GenoMax 147k

Using EntrezDirect one can do something like this:

$ esearch -db protein -query "peptide" | efetch -format fasta | grep ">" | head -10
>QGT67293.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
>QGT67288.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
>QGT67085.1 thr operon leader peptide [Klebsiella pneumoniae]
>QGT67083.1 leu operon leader peptide [Klebsiella pneumoniae]
>QGT67062.1 peptide antibiotic transporter SbmA [Klebsiella pneumoniae]
>QGT66988.1 pyrroloquinoline quinone precursor peptide PqqA [Klebsiella pneumoniae]
>QGT66961.1 phenylalanyl--tRNA ligase operon leader peptide [Klebsiella pneumoniae]
>QGT66959.1 peptide chain release factor N(5)-glutamine methyltransferase [Klebsiella pneumoniae]
>QGT66942.1 his operon leader peptide [Klebsiella pneumoniae]
>QGT66735.1 ilv operon leader peptide [Klebsiella pneumoniae]

Remove the grep ">" | head -10 to get the actual sequences.

This may just get you sequences that have word peptide in their <Title> field. Not sure if that is what you ultimately want.

Using biopython is probably not the right choice of tool here since you are going to get hundreds of thousands of sequences.

You could also get the fasta file for nr blast database from NCBI and parse out things you need.

ADD COMMENT
0
Entering edit mode

this is fantastic thank you

ADD REPLY
0
Entering edit mode

Just to your earlier point about the number of sequences, is it possible to add a filter to only pull down in fasta sequence below a max length? Because i can see what you're saying, some just say peptide in the header but are full proteins, i just want to make a test set so pulling out the shorter sequences based on this criteria is fine. But let me know if you think this is a completely separate question.

Update: am trying this:

esearch -db protein -query "peptide '1:100[SLEN]" | efetch -format fast a >> ncbi_slen.fasta

ADD REPLY
0
Entering edit mode

Try this to get peptides that are 30 AA or less. Remove head -15 to get more.

$ esearch -db protein -query "peptide" | esummary | xtract -pattern DocumentSummary -element Caption,Slen | head -15 | awk -F ' ' '{if ($2 < 30) {print $1}}'| xargs -n 1 sh -c 'efetch -db protein -id $0 -format fasta' 
>QGT67293.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
MLRKLQAQFLCHSLLLCNISAGSGD
>QGT67288.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
MLRKLQAQFLCHSLLLCNISAGSGD
>QGT67085.1 thr operon leader peptide [Klebsiella pneumoniae]
MNRIGMITTIITTTITTGNGAG
>QGT67083.1 leu operon leader peptide [Klebsiella pneumoniae]
MIRTARITSLLLLNACHLRGRLLGDVQR
>QGT66988.1 pyrroloquinoline quinone precursor peptide PqqA [Klebsiella pneumoniae]
MWKKPAFIDLRLGLEVTLYISNR
>QGT66961.1 phenylalanyl--tRNA ligase operon leader peptide [Klebsiella pneumoniae]
MNAAIFRFFFYFST
>QGT66942.1 his operon leader peptide [Klebsiella pneumoniae]
MNRVQFKHHHHHHHPD
ADD REPLY

Login before adding your answer.

Traffic: 2003 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6