Question

NCBI API, Perl

0

Entering edit mode

18 months ago

alessandro.alma00 • 0

Hi! I want to use NCBI API in order to retrieve the fasta sequences from a list of lncRNA that i'm interested in, there are someone that have done this procedure in past? It is possible to do? please help me

API NCBI • 1.3k views

ADD COMMENT • link updated 18 months ago by GenoMax 147k • written 18 months ago by alessandro.alma00 • 0

score 1 · Answer 1 · 2023-06-05

1

Entering edit mode

18 months ago

GenoMax 147k

You are likely best served by using NCBI datasets (LINK) or EntrezDirect (LINK) command line tools.

If you want to avoid programming all together then use the web interface for Datasets.

ADD COMMENT • link 18 months ago by GenoMax 147k

0

Entering edit mode

It doesn't help me a lot. But you known if it is possible to do what i want?

ADD REPLY • link 18 months ago by alessandro.alma00 • 0

1

Entering edit mode

What does not help you? You can use both tools above to retrieve sequences using command line. Provide a few example accessions and I can show you how to download sequence using these tools.

ADD REPLY • link 18 months ago by GenoMax 147k

0

Entering edit mode

Sorry i didn't want to be rude.. Thank you for helping me, im sorry but its the first time I use API tbh... i will like to retrieve the sequence of i.e.: LOC105370256 LOC105377806 LOC107984590 LOC107984591

To be honest i don't know if this can be use as accession name, but from a list of lncRNA as this i will like to retrieve the sequence. Thank you for helping me, appreciate a lot!

ADD REPLY • link 18 months ago by alessandro.alma00 • 0

2

Entering edit mode

For completeness, using web interface of datasets: https://www.ncbi.nlm.nih.gov/datasets/gene/

web

ADD REPLY • link 18 months ago by GenoMax 147k

1

Entering edit mode

Using EntrezDirect:

$ esearch -db nuccore -query LOC105370256 | efetch -format acc | grep -v NC | xargs -n 1 sh -c 'efetch -db nuccore -id "$0" -format fasta'

will get you the following variants. I am removing genomic/chromosome records to just get you ncRNA.

>XR_007083934.1 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X13, ncRNA
>XR_007083933.1 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X12, ncRNA
>XR_007083932.1 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X11, ncRNA
>XR_007083931.1 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X10, ncRNA
>XR_007083930.1 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X9, ncRNA
>XR_007083929.1 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X8, ncRNA
>XR_007083928.1 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X7, ncRNA
>XR_007083927.1 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X14, ncRNA
>XR_007083926.1 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X5, ncRNA
>XR_007083925.1 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X4, ncRNA
>XR_007083924.1 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X3, ncRNA
>XR_007083923.1 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X2, ncRNA
>XR_007083922.1 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X1, ncRNA
>XR_942077.2 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X12, ncRNA
>XR_001749903.2 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X11, ncRNA
>XR_001749901.2 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X10, ncRNA
>XR_942075.3 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X8, ncRNA
>XR_001749900.2 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X7, ncRNA
>XR_942071.3 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X6, ncRNA
>XR_942068.3 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X5, ncRNA
>XR_942072.3 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X1, ncRNA
>XR_001749904.1 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X13, ncRNA
>XR_001749902.1 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X9, ncRNA
>XR_942067.1 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X4, ncRNA
>XR_942066.1 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X3, ncRNA
>XR_942064.1 PREDICTED: Homo sapiens uncharacterized LOC105370256 (LOC105370256), transcript variant X2, ncRNA

LOC designations are used for genes that do not have a final gene ID.

Symbols beginning with LOC. When a published symbol is not available, and orthologs have not yet been determined, Gene will provide a symbol that is constructed as 'LOC' + the GeneID. This is not retained when a replacement symbol has been identified, although queries by the LOC term are still supported. In other words, a record with the symbol LOC12345 is equivalent to GeneID = 12345. So if the symbol changes, the record can still be retrieved on the web using LOC12345 as a query, or from any file using GeneID = 12345.

LOC105377806, LOC107984590 do not appear to be a valid ID. So with such ID you will get an error.

$ esearch -db nuccore -query LOC107984591  | efetch -format acc | grep -v NC | xargs -n 1 sh -c 'efetch -db nuccore -id "$0" -format fasta' | grep "^>"
>XR_007083741.1 PREDICTED: Homo sapiens uncharacterized LOC107984591 (LOC107984591), ncRNA
>XR_001750037.2 PREDICTED: Homo sapiens uncharacterized LOC107984591 (LOC107984591), ncRNA

ADD REPLY • link 18 months ago by GenoMax 147k

1

Entering edit mode

Hi, Following Genomax' suggestion, you can use NCBI Datasets. I used the IDs you posted to create a list.

LOC105370256
LOC105377806
LOC107984590
LOC107984591

The I used the NCBI Datasets CLI to download the FASTA sequences for those gene-ids.

datasets download gene symbol --inputfile list.txt --filename genes.zip

By default, the NCBI Datasets includes transcript and protein sequences (or in this case, only rna sequences):

unzip genes.zip -d genes
Archive:  genes.zip
  inflating: genes/README.md         
  inflating: genes/ncbi_dataset/data/rna.fna  
  inflating: genes/ncbi_dataset/data/data_report.jsonl  
  inflating: genes/ncbi_dataset/data/dataset_catalog.json

All requested transcript sequences will be in the file rna.fna:

grep ">" genes/ncbi_dataset/data/rna.fna | head

>XR_942064.1 LOC105370256 [organism=Homo sapiens] [GeneID=105370256] [transcript=X2]
>XR_942067.1 LOC105370256 [organism=Homo sapiens] [GeneID=105370256] [transcript=X4]
>XR_001749902.1 LOC105370256 [organism=Homo sapiens] [GeneID=105370256] [transcript=X9]
>XR_001749904.1 LOC105370256 [organism=Homo sapiens] [GeneID=105370256] [transcript=X13]
>XR_001749901.2 LOC105370256 [organism=Homo sapiens] [GeneID=105370256] [transcript=X10]
>XR_001749903.2 LOC105370256 [organism=Homo sapiens] [GeneID=105370256] [transcript=X11]
>XR_001749900.2 LOC105370256 [organism=Homo sapiens] [GeneID=105370256] [transcript=X7]
>XR_942066.1 LOC105370256 [organism=Homo sapiens] [GeneID=105370256] [transcript=X3]
>XR_942068.3 LOC105370256 [organism=Homo sapiens] [GeneID=105370256] [transcript=X5]
>XR_942075.3 LOC105370256 [organism=Homo sapiens] [GeneID=105370256] [transcript=X8]

You can also retrieve the gene sequences by using the flag --include gene.

One important point is that the LOC genes are part of an annotation that can be updated and result in a discontinuation of a gene. In this case, two of them (LOC105377806 and LOC107984590) are discontinued and we have no data for them.

I hope this helps! :)

ADD REPLY • link 18 months ago by MirianT_NCBI ▴ 760