Question

How to retrieve a mRNA sequence from NCBI?

0

Entering edit mode

2.2 years ago

Francois Piumi ▴ 70

The NCBI database is installed on a server in our Institute. I'd like to retrieve a mRNA sequence from a gene name passed as argument in a command line. What is the right command line? Thanks

NCBI mRNA • 2.5k views

ADD COMMENT • link updated 2.2 years ago by josev.die ▴ 70 • written 2.2 years ago by Francois Piumi ▴ 70

0

Entering edit mode

Please, provide an example of a gene and the mRNA sequence that you´d like to obtain.

ADD REPLY • link 2.2 years ago by josev.die ▴ 70

0

Entering edit mode

for example I need to obtain the mRNA sequence of the ADAMTS13 gene. the ncbi libraries on our servers are located under a /bank/ncbi folder which contains, among others, a genbank folder

ADD REPLY • link 2.2 years ago by Francois Piumi ▴ 70

score 1 · Answer 1 · 2023-02-21

1

Entering edit mode

2.2 years ago

GenoMax 151k

By "ncbi libraries" are you referring to pre-formatted blast databases?

That said you are better off using the datasets method mentioned below or Entrezdirect answer here.

Using EntrezDirect:

$ esearch -db gene -query "ADAMTS13 [GENE] AND Homo sapiens [ORGN] AND ALIVE [PROP]" |  elink -db gene -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta_cds_na

ADD COMMENT • link 2.2 years ago by GenoMax 151k

0

Entering edit mode

Thanks GenoMax, I tried this command in local (not on our cluster) and it works also very well!

ADD REPLY • link 2.2 years ago by Francois Piumi ▴ 70

score 0 · Answer 2 · 2023-02-21

0

Entering edit mode

2.2 years ago

MirianT_NCBI ▴ 790

Hi Francois,

I believe you can do this using NCBI Datasets. NCBI Datasets allows you to search for RefSeq genes by symbol, NCBI gene-id, taxon and accession and choose which files to download. Here's the link to the download instructions: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/

In your example, you want to download the mRNA sequence for the gene ADAMTS13. Using datasets, you can follow this command:

datasets download gene symbol ADAMTS13 --include rna --filename adamts13.zip

unzip adamts13.zip -d adamts13     

Archive:  adamts13.zip
  inflating: adamts13/README.md      
  inflating: adamts13/ncbi_dataset/data/rna.fna  
  inflating: adamts13/ncbi_dataset/data/data_report.jsonl  
  inflating: adamts13/ncbi_dataset/data/dataset_catalog.json

By default, datasets retrieves gene symbols for human. If you want to retrieve a gene symbol information for another taxon, you can use the flag --taxon and provide a species-level taxonomic name.

Other options available with the --include flag are:

      --include string(,string)   Specify the data files to include (comma-separated).
                                    * gene:           gene sequence
                                    * rna:            transcript
                                    * protein:        amino acid sequences
                                    * cds:            nucleotide coding sequences
                                    * 5p-utr:         5'-UTR
                                    * 3p-utr:         3'-UTR
                                    * product-report: gene transcript and protein locations and metadata
                                    * none:           do not retrieve any sequence files

I hope this helps. Please let me know if you have any other questions.

ADD COMMENT • link 2.2 years ago by MirianT_NCBI ▴ 790

0

Entering edit mode

MirianT_NCBI the original request is to retrieve the information from local data. I assume OP probably refers to pre-formatted blast databases.

ADD REPLY • link 2.2 years ago by GenoMax 151k

0

Entering edit mode

Thanks very much MirianT_NCBI, but as GenoMax rightly pointed out, NCBI datasets are already installed on my Institute's server.

What I'd like to do is retrieve the ADAMTS13mRNA sequence from one directory on the server, but I don't know which one to choose and the right command to retrieve it. The list of directories are :

accession2taxid
dbest
genbank_viral_nr
gene
taxdump
univec blast
genbank
genbank_viral_nt
genomes
unigene

ADD REPLY • link 2.2 years ago by Francois Piumi ▴ 70

1

Entering edit mode

These appear to be pre-formatted blast databases. You could use blastdbcmd which is part of the blast+ package to retrieve the sequence but you would need to have an accession number for it to work. If you have a couple three genes then this should not be difficult. (sequence truncated)

$ blastdbcmd -db nt -entry NM_139026 -outfmt %f
>NM_139026.6 Homo sapiens ADAM metallopeptidase with thrombospondin type 1 motif 13 (ADAMTS13), transcript variant 3, mRNA
ATTCCATACTGACCAGATTCCCAGTCACCAAGGCCCCCTCTCACTCCGCTCCACTCCTCGGGCTGGCTCTCCTGAGGATG
CACCAGCGTCACCCCCGGGCAAGATGCCCTCCCCTCTGTGTGGCCGGAATCCTTGCCTGTGGCTTTCTCCTGGGCTGCT

ADD REPLY • link 2.2 years ago by GenoMax 151k

0

Entering edit mode

Works perfectly well! Thanks very much GenoMax !!

ADD REPLY • link 2.2 years ago by Francois Piumi ▴ 70

0

Entering edit mode

Got it, Sorry I misunderstood your question.

ADD REPLY • link 2.2 years ago by MirianT_NCBI ▴ 790

0

Entering edit mode

No worries MirianT_NCBI, thanks very much for your help!

ADD REPLY • link 2.2 years ago by Francois Piumi ▴ 70

score 0 · Answer 3 · 2023-02-24

Provided that you have your accession numbers, here a possible translation of GenoMax answer to R :

 #Dependencies 
library(Biostrings)
library(rentrez)

 ids = c("NM_139026.6", "NM_139027.6")
#Define a function to extract the fasta secuence 
 get_sequence <- function(id) {
           target = rentrez::entrez_fetch(db = "nuccore", id = id, rettype = "fasta")
           target_tidy = strsplit(target, "\n")
           my_seq <- as.character(paste0(target_tidy[[1]][2:length(target_tidy[[1]])], collapse = ""))
           my_seq }


#Run the function over your ids 
my_set <- sapply(ids, function(i) get_sequence(i))

#Result 
DNAStringSet(my_set)

DNAStringSet object of length 2:
width seq                                                                                                                                  names               
[1]  4306 ATTCCATACTGACCAGATTCCCAGTCACCAAGGCCCCCTCTCACTCCGCTCCACTCCTCGGGCTG...GTGGGGACTCTGGAAAAGCAGCCCCCATTTCCTCGGGTACCAATAAATAAAACATGCAGGCTGA NM_139026.6
[2]  4399 ATTCCATACTGACCAGATTCCCAGTCACCAAGGCCCCCTCTCACTCCGCTCCACTCCTCGGGCTG...GTGGGGACTCTGGAAAAGCAGCCCCCATTTCCTCGGGTACCAATAAATAAAACATGCAGGCTGA NM_139027.6