Question

Download all peptide sequences from NCBI in fasta format?

0

Entering edit mode

5.0 years ago

Tom ▴ 50

I want to download in fasta format all the peptide sequences in the NCBI protein database (i.e. > and the peptide name, followed by the peptide sequence), I saw there is a MESH term describing what a peptide is here, but I can't work out how to incorporate it.

I wrote this:

import Bio
from Bio import Entrez

Entrez.email = 'test@gmail.com'
handle = Entrez.esearch(db="protein", term="peptide")
record = handle.read()
out_handle = open('myfasta.fasta', 'w')
out_handle.write(record.rstrip('\n'))

but it only prints out 995 IDs, no sequences to file, I'm wondering if someone could demonstrate where I'm going wrong.

biopython • 1.6k views

ADD COMMENT • link updated 4.8 years ago by Biostar 20 • written 5.0 years ago by Tom ▴ 50

0

Entering edit mode

genomax appears to have answered. You may also find a couple of my Python scripts of some use for this work that you are doing: https://github.com/kevinblighe/PythonScripts

ADD REPLY • link 4.8 years ago by Kevin Blighe 88k

score 2 · Answer 1 · 2019-12-10

2

Entering edit mode

5.0 years ago

GenoMax 147k

Using EntrezDirect one can do something like this:

$ esearch -db protein -query "peptide" | efetch -format fasta | grep ">" | head -10
>QGT67293.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
>QGT67288.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
>QGT67085.1 thr operon leader peptide [Klebsiella pneumoniae]
>QGT67083.1 leu operon leader peptide [Klebsiella pneumoniae]
>QGT67062.1 peptide antibiotic transporter SbmA [Klebsiella pneumoniae]
>QGT66988.1 pyrroloquinoline quinone precursor peptide PqqA [Klebsiella pneumoniae]
>QGT66961.1 phenylalanyl--tRNA ligase operon leader peptide [Klebsiella pneumoniae]
>QGT66959.1 peptide chain release factor N(5)-glutamine methyltransferase [Klebsiella pneumoniae]
>QGT66942.1 his operon leader peptide [Klebsiella pneumoniae]
>QGT66735.1 ilv operon leader peptide [Klebsiella pneumoniae]

Remove the grep ">" | head -10 to get the actual sequences.

This may just get you sequences that have word peptide in their <Title> field. Not sure if that is what you ultimately want.

Using biopython is probably not the right choice of tool here since you are going to get hundreds of thousands of sequences.

You could also get the fasta file for nr blast database from NCBI and parse out things you need.

ADD COMMENT • link 5.0 years ago by GenoMax 147k

0

Entering edit mode

this is fantastic thank you

ADD REPLY • link 5.0 years ago by Tom ▴ 50

0

Entering edit mode

Just to your earlier point about the number of sequences, is it possible to add a filter to only pull down in fasta sequence below a max length? Because i can see what you're saying, some just say peptide in the header but are full proteins, i just want to make a test set so pulling out the shorter sequences based on this criteria is fine. But let me know if you think this is a completely separate question.

Update: am trying this:

esearch -db protein -query "peptide '1:100[SLEN]" | efetch -format fast a >> ncbi_slen.fasta

ADD REPLY • link 5.0 years ago by Tom ▴ 50

0

Entering edit mode

Try this to get peptides that are 30 AA or less. Remove head -15 to get more.

$ esearch -db protein -query "peptide" | esummary | xtract -pattern DocumentSummary -element Caption,Slen | head -15 | awk -F ' ' '{if ($2 < 30) {print $1}}'| xargs -n 1 sh -c 'efetch -db protein -id $0 -format fasta' 
>QGT67293.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
MLRKLQAQFLCHSLLLCNISAGSGD
>QGT67288.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
MLRKLQAQFLCHSLLLCNISAGSGD
>QGT67085.1 thr operon leader peptide [Klebsiella pneumoniae]
MNRIGMITTIITTTITTGNGAG
>QGT67083.1 leu operon leader peptide [Klebsiella pneumoniae]
MIRTARITSLLLLNACHLRGRLLGDVQR
>QGT66988.1 pyrroloquinoline quinone precursor peptide PqqA [Klebsiella pneumoniae]
MWKKPAFIDLRLGLEVTLYISNR
>QGT66961.1 phenylalanyl--tRNA ligase operon leader peptide [Klebsiella pneumoniae]
MNAAIFRFFFYFST
>QGT66942.1 his operon leader peptide [Klebsiella pneumoniae]
MNRVQFKHHHHHHHPD

ADD REPLY • link 5.0 years ago by GenoMax 147k