I would like to extract all proteins, using a list of nucleotide accession numbers as input. For example, considering the list.txt with the following accessions:
Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
However, for the four accessions in your list, nearly 24000 proteins are returned. Downloading that many proteins using efetch can quickly become a time-consuming process. If you are doing this for entire chromosomes, you may be better off with the following three-step approach:
use efetch with the parameter -format acc to download a list of protein accessions
downloading the entire protein datasets for the organisms of your interest from NCBI FTP
use a different program such as seqkit to extract the specific protein accessions of interest
Thanks for the reply. Your solution works well, but it outputs tons of proteins with no link to the chromosome. I would like to link the extracted proteins to the chromosome. Any solution?
Which solution are you talking about? The one using efetch or the one where you download from from FTP path?
If you download the entire protein.faa.gz file(s) from FTP, there is another file ending in feature_table.txt.gz in the same path. It should have information about which chromosome each protein is annotated on.
If you want to do this using esearch/efetch method then you'd have to skip the epost step and do this for each acc using a bash loop as shown below:
for acc in `cat accs.txt`; do
esearch -db nuccore -query ${acc} \
| elink -target protein \
| efetch -format acc \
| sed "s/^/${acc}\t/g" ;
done
This will produce a tab-delimited file with <chromosome> <tab> <protein_acc> fields.
Please use the formatting bar (especially the
code
option) to present your post better. You can use backticks for inline code (`text` becomestext
), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.see Retrieve The Fasta Nucleic Sequences Of A List Of Ncbi Accession Number Of Proteins
If you get the assembly accession number and download the protein.faa, and extract the relevant ones, would it work for you?
For example:
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/295/GCF_000002295.2_MonDom5/
GCF_000002295.2_MonDom5_protein.faa.gz