Hello all,
I have a large dataset of NCBI gene IDs, and need to retrieve their corresponding protein sequences. Are there any data mining tools that can easily do this? Thanks in advance.
Hello all,
I have a large dataset of NCBI gene IDs, and need to retrieve their corresponding protein sequences. Are there any data mining tools that can easily do this? Thanks in advance.
Hi,
You can use NCBI Datasets. For example, let's say you have a text file with five NCBI Gene IDs (gene_ids.txt
):
672
7157
7124
348
7422
You can use list as an input for datasets
and download a gene data package. By default, the package includes the protein sequences, as well as transcript and gene sequences (plus metadata). If you want to restrict it to protein only, you can use this command:
datasets download gene gene-id --inputfile gene_ids.txt --exclude-rna --exclude-gene --filename genes.zip
After you unzip the file (I unzipped it to the folder gene_list
), you can find all protein isoforms in the file protein.faa
. Here's the folder structure:
gene_list/
|-- README.md
`-- ncbi_dataset
`-- data
|-- data_report.jsonl
|-- data_table.tsv
|-- dataset_catalog.json
`-- protein.faa
2 directories, 5 files
Let me know if you have any questions. :)
Using Entrezdirect this is simple to do:
If you have gene ID's that are numeric (sequence truncated for space):
$ esearch -db gene -query 945768 | elink -target protein | efetch -format fasta
>NP_415777.1 tryptophan synthase subunit beta [Escherichia coli str. K-12 substr. MG1655]
MTTLLNPYFGEFGGMYVPQILMPALRQLEEAFVSAQKDPEFQAQFNDLLKNYAGRPTALTKCQNITAGTN
>AAC74343.1 tryptophan synthase subunit beta [Escherichia coli str. K-12 substr. MG1655]
MTTLLNPYFGEFGGMYVPQILMPALRQLEEAFVSAQKDPEFQAQFNDLLKNYAGRPTALTKCQNITAGTN
TTLYLKREDLLHGGAHKTNQVLGQALLAKRMGKTEIIAETGAGQHGVASALASALLGLKCRIYMGAKDVE
In case you have accession numbers of proteins (put one ID per line in a file):
$ more id
ABA43103.1
ABA43104.1
ABA43105.1
$ epost -db protein -input id | efetch -format fasta
>ABA43105.1 nonstructural protein, partial [Norovirus Hu/GI/N9/2003/Irl]
DRNLLPEFVNDDGV
>ABA43104.1 TrpB, partial [Kitasatospora aureofaciens]
NNVLGQALLTRRMGKTRIIAETGAGQHGVATATACALFGFDCTIYMGEVDTERQALNVARMRMLGAEVIA
VKSGSRTLKDAINEAFRDWVANVDSTHYLFGTVAGPHPFPMMVRDFHRIIGVEARQQVLDRTGRLPDAVV
ACVGGGSNAIG
>ABA43103.1 TrpB, partial [Streptomyces lydicus]
NNVLGKALLTKRMGKTRVIAETGAGQHGVATATACALFGLECTIYMGEIDTQRQALNVARMRMLGAEVIA
VKSGSRTLKDAINEAFRDWVANVDRTHYLFGTVAGPHPFPALVRDFHRVIGVEARRQLLERAGRLPDAAL
ACVGGGSNAIG
For R
users :
#Dependencies
library(rentrez)
Define some functions :
get_protids_from_geneids <- function(gene_ids) {
# Get the protein elink.
protein_elink <- rentrez::entrez_link(dbfrom = "gene", id = gene_ids, db = "protein")
# Get the protein id (refseq database)
protein_ids <- protein_elink$links$gene_protein_refseq
protein_ids
}
make_aa_fasta <- function(prot_ids, nameFile) {
# Make a multi-fasta file with each protein id
sapply(prot_ids, function(x) {
protein_esummary <- rentrez::entrez_summary(db = "protein", id = x)
protein_fasta <- rentrez::entrez_fetch(db = "protein", id = protein_esummary$uid, rettype = "fasta")
# save amino acid sequences into a FASTA file ("nameFile"")
write(protein_fasta, file = paste(nameFile, ".fasta", sep = ""), append = TRUE)
} )
}
Following the MirianT_NCBI example : load the NCBI Gene IDs into a vector :
# Define a vector with gene ids
gene_ids = c('672', '7157', '7124', '348', '7422')
# Get the protein ids
prot_ids <- get_protids_from_geneids(gene_ids)
#length(prot_ids)
# Make the amino acid fasta file
make_aa_fasta(prot_ids, "my_proteins")
Keep length(prot_ids)
< 450 and it will work.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
dos puntos right quick:
Punto 1: Ensembl is puh-reeetttttyyyy good for this as well; and you can also use UCSC Table Browser. Punto 2: keep in mind that genes are not genes are not genes are not genes due to alternative splicing. as such, it is necessary to specify both a gene name and a transcript isoform in mind in order to describe something unique.