The NCBI database is installed on a server in our Institute. I'd like to retrieve a mRNA sequence from a gene name passed as argument in a command line. What is the right command line? Thanks
The NCBI database is installed on a server in our Institute. I'd like to retrieve a mRNA sequence from a gene name passed as argument in a command line. What is the right command line? Thanks
By "ncbi libraries" are you referring to pre-formatted blast databases?
That said you are better off using the datasets
method mentioned below or Entrezdirect
answer here.
Using EntrezDirect:
$ esearch -db gene -query "ADAMTS13 [GENE] AND Homo sapiens [ORGN] AND ALIVE [PROP]" | elink -db gene -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta_cds_na
Hi Francois,
I believe you can do this using NCBI Datasets. NCBI Datasets allows you to search for RefSeq genes by symbol, NCBI gene-id, taxon and accession and choose which files to download. Here's the link to the download instructions: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/
In your example, you want to download the mRNA sequence for the gene ADAMTS13. Using datasets
, you can follow this command:
datasets download gene symbol ADAMTS13 --include rna --filename adamts13.zip
unzip adamts13.zip -d adamts13
Archive: adamts13.zip
inflating: adamts13/README.md
inflating: adamts13/ncbi_dataset/data/rna.fna
inflating: adamts13/ncbi_dataset/data/data_report.jsonl
inflating: adamts13/ncbi_dataset/data/dataset_catalog.json
By default, datasets
retrieves gene symbols for human. If you want to retrieve a gene symbol information for another taxon, you can use the flag --taxon
and provide a species-level taxonomic name.
Other options available with the --include
flag are:
--include string(,string) Specify the data files to include (comma-separated).
* gene: gene sequence
* rna: transcript
* protein: amino acid sequences
* cds: nucleotide coding sequences
* 5p-utr: 5'-UTR
* 3p-utr: 3'-UTR
* product-report: gene transcript and protein locations and metadata
* none: do not retrieve any sequence files
I hope this helps. Please let me know if you have any other questions.
MirianT_NCBI the original request is to retrieve the information from local data. I assume OP probably refers to pre-formatted blast databases.
Thanks very much MirianT_NCBI, but as GenoMax rightly pointed out, NCBI datasets are already installed on my Institute's server.
What I'd like to do is retrieve the ADAMTS13mRNA sequence from one directory on the server, but I don't know which one to choose and the right command to retrieve it. The list of directories are :
accession2taxid
dbest
genbank_viral_nr
gene
taxdump
univec
blast
genbank
genbank_viral_nt
genomes
unigene
These appear to be pre-formatted blast databases. You could use blastdbcmd
which is part of the blast+
package to retrieve the sequence but you would need to have an accession number for it to work. If you have a couple three genes then this should not be difficult. (sequence truncated)
$ blastdbcmd -db nt -entry NM_139026 -outfmt %f
>NM_139026.6 Homo sapiens ADAM metallopeptidase with thrombospondin type 1 motif 13 (ADAMTS13), transcript variant 3, mRNA
ATTCCATACTGACCAGATTCCCAGTCACCAAGGCCCCCTCTCACTCCGCTCCACTCCTCGGGCTGGCTCTCCTGAGGATG
CACCAGCGTCACCCCCGGGCAAGATGCCCTCCCCTCTGTGTGGCCGGAATCCTTGCCTGTGGCTTTCTCCTGGGCTGCT
Provided that you have your accession numbers, here a possible translation of GenoMax answer to R
:
#Dependencies
library(Biostrings)
library(rentrez)
ids = c("NM_139026.6", "NM_139027.6")
#Define a function to extract the fasta secuence
get_sequence <- function(id) {
target = rentrez::entrez_fetch(db = "nuccore", id = id, rettype = "fasta")
target_tidy = strsplit(target, "\n")
my_seq <- as.character(paste0(target_tidy[[1]][2:length(target_tidy[[1]])], collapse = ""))
my_seq }
#Run the function over your ids
my_set <- sapply(ids, function(i) get_sequence(i))
#Result
DNAStringSet(my_set)
DNAStringSet object of length 2:
width seq names
[1] 4306 ATTCCATACTGACCAGATTCCCAGTCACCAAGGCCCCCTCTCACTCCGCTCCACTCCTCGGGCTG...GTGGGGACTCTGGAAAAGCAGCCCCCATTTCCTCGGGTACCAATAAATAAAACATGCAGGCTGA NM_139026.6
[2] 4399 ATTCCATACTGACCAGATTCCCAGTCACCAAGGCCCCCTCTCACTCCGCTCCACTCCTCGGGCTG...GTGGGGACTCTGGAAAAGCAGCCCCCATTTCCTCGGGTACCAATAAATAAAACATGCAGGCTGA NM_139027.6
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Please, provide an example of a gene and the mRNA sequence that you´d like to obtain.
for example I need to obtain the mRNA sequence of the ADAMTS13 gene. the ncbi libraries on our servers are located under a /bank/ncbi folder which contains, among others, a genbank folder