Trying to retrieve set of protein sequences based on gene IDs
3
0
Entering edit mode
2.4 years ago
JONATHAN • 0

Hello all,

I have a large dataset of NCBI gene IDs, and need to retrieve their corresponding protein sequences. Are there any data mining tools that can easily do this? Thanks in advance.

gene NCBI protein • 2.1k views
ADD COMMENT
0
Entering edit mode

dos puntos right quick:

Punto 1: Ensembl is puh-reeetttttyyyy good for this as well; and you can also use UCSC Table Browser. Punto 2: keep in mind that genes are not genes are not genes are not genes due to alternative splicing. as such, it is necessary to specify both a gene name and a transcript isoform in mind in order to describe something unique.

ADD REPLY
1
Entering edit mode
2.3 years ago
MirianT_NCBI ▴ 770

Hi,
You can use NCBI Datasets. For example, let's say you have a text file with five NCBI Gene IDs (gene_ids.txt):

672
7157
7124
348
7422

You can use list as an input for datasets and download a gene data package. By default, the package includes the protein sequences, as well as transcript and gene sequences (plus metadata). If you want to restrict it to protein only, you can use this command:

datasets download gene gene-id --inputfile gene_ids.txt --exclude-rna --exclude-gene --filename genes.zip

After you unzip the file (I unzipped it to the folder gene_list), you can find all protein isoforms in the file protein.faa. Here's the folder structure:

gene_list/
|-- README.md
`-- ncbi_dataset
    `-- data
        |-- data_report.jsonl
        |-- data_table.tsv
        |-- dataset_catalog.json
        `-- protein.faa

2 directories, 5 files

Let me know if you have any questions. :)

ADD COMMENT
0
Entering edit mode
2.4 years ago
GenoMax 148k

Using Entrezdirect this is simple to do:

If you have gene ID's that are numeric (sequence truncated for space):

$ esearch -db gene -query 945768 | elink -target protein | efetch -format fasta
>NP_415777.1 tryptophan synthase subunit beta [Escherichia coli str. K-12 substr. MG1655]
MTTLLNPYFGEFGGMYVPQILMPALRQLEEAFVSAQKDPEFQAQFNDLLKNYAGRPTALTKCQNITAGTN
>AAC74343.1 tryptophan synthase subunit beta [Escherichia coli str. K-12 substr. MG1655]
MTTLLNPYFGEFGGMYVPQILMPALRQLEEAFVSAQKDPEFQAQFNDLLKNYAGRPTALTKCQNITAGTN
TTLYLKREDLLHGGAHKTNQVLGQALLAKRMGKTEIIAETGAGQHGVASALASALLGLKCRIYMGAKDVE

In case you have accession numbers of proteins (put one ID per line in a file):

$  more id
ABA43103.1 
ABA43104.1
ABA43105.1  

$ epost -db protein -input id | efetch -format fasta
>ABA43105.1 nonstructural protein, partial [Norovirus Hu/GI/N9/2003/Irl]
DRNLLPEFVNDDGV
>ABA43104.1 TrpB, partial [Kitasatospora aureofaciens]
NNVLGQALLTRRMGKTRIIAETGAGQHGVATATACALFGFDCTIYMGEVDTERQALNVARMRMLGAEVIA
VKSGSRTLKDAINEAFRDWVANVDSTHYLFGTVAGPHPFPMMVRDFHRIIGVEARQQVLDRTGRLPDAVV
ACVGGGSNAIG
>ABA43103.1 TrpB, partial [Streptomyces lydicus]
NNVLGKALLTKRMGKTRVIAETGAGQHGVATATACALFGLECTIYMGEIDTQRQALNVARMRMLGAEVIA
VKSGSRTLKDAINEAFRDWVANVDRTHYLFGTVAGPHPFPALVRDFHRVIGVEARRQLLERAGRLPDAAL
ACVGGGSNAIG
ADD COMMENT
0
Entering edit mode
2.1 years ago
josev.die ▴ 70

For R users :

#Dependencies
library(rentrez)

Define some functions :

 get_protids_from_geneids <- function(gene_ids) {

 # Get the protein elink.
 protein_elink <- rentrez::entrez_link(dbfrom = "gene", id = gene_ids, db = "protein")

 # Get the protein id (refseq database)
 protein_ids <- protein_elink$links$gene_protein_refseq
 protein_ids
 }

make_aa_fasta <- function(prot_ids, nameFile) {

# Make a multi-fasta file with each protein id
 sapply(prot_ids, function(x) {
 protein_esummary <- rentrez::entrez_summary(db = "protein", id = x)
 protein_fasta <- rentrez::entrez_fetch(db = "protein", id = protein_esummary$uid, rettype = "fasta")
 # save amino acid sequences into a FASTA file ("nameFile"")
 write(protein_fasta, file = paste(nameFile, ".fasta", sep = ""), append = TRUE)
  } )
  }

Following the MirianT_NCBI example : load the NCBI Gene IDs into a vector :

# Define a vector with gene ids 
gene_ids = c('672', '7157', '7124', '348', '7422')

# Get the protein ids 
prot_ids <- get_protids_from_geneids(gene_ids)
#length(prot_ids)

# Make the amino acid fasta file 
make_aa_fasta(prot_ids, "my_proteins")

Keep length(prot_ids) < 450 and it will work.

ADD COMMENT

Login before adding your answer.

Traffic: 3427 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6