Question

retrieveing gene list from NCBI

0

Entering edit mode

2.9 years ago

v.berriosfarias ▴ 140

Hello I want to retrieve a tsv file with summary text from gene IDs so I can use it as input for QUAST assembly assesment.

THe issue is that I was reading this: https://www.ncbi.nlm.nih.gov/books/NBK3840/#genefaq.How_to_extract_the_Summary_text

there on "How to extract the Summary text from records in Gene" subtittle it appears that its conveniant to use a perl file named geneDocSum.pl that I can download from here: https://ftp.ncbi.nih.gov/gene/tools/

But I dont understand the usage of the tool when querying gene information. As an example they provide this:

geneDocSum.pl -q "has_summary[prop] AND human[orgn]" -o tab -t Name -t Summary

with the above command its possible to retrieve al human gene records but I dont understand the "has_summary[prop] AND human[orgn]" syntax. I want to retrieve all gene records from this algae: https://www.ncbi.nlm.nih.gov/data-hub/gene/table/taxon/63659/?gene-type=Protein-coding

so How can I do that?

Thanks for your time

NCBI gene perl • 1.2k views

ADD COMMENT • link updated 2.9 years ago by GenoMax 147k • written 2.9 years ago by v.berriosfarias ▴ 140

1

Entering edit mode

You can try this:

geneDocSum.pl -q "has_summary[prop] AND Ulva compressa [orgn]" -o tab -t Name -t Summary

orgn = Organism you are interested in so replace human with your organism of choice.

ADD REPLY • link 2.9 years ago by GenoMax 147k

score 1 · Answer 1 · 2021-12-30

UPDATE: I just noticed that you are specifically looking for the gene summary format, which is not available (yet) on NCBI datasets, so that might not be helpful. Sorry about that.
Also, this species Ulva compressa doesn't have any gene summaries. The genes in the gene table you posted are either from the mitochondria or chloroplast. There are two genomes in the same genus: https://www.ncbi.nlm.nih.gov/labs/data-hub/taxonomy/3118/, but neither has nuclear gene annotations. Let me know if there's anything else I can help you with.

Hi,
In the link you posted, you already have a list of all genes available for the reference genome of that algae species you posted. From that page, you can click in the Download (circled in red) button and choose what you need:

Data package: it includes protein, sequence and transcript sequences (you choose which files you want)
Data table: downloads a tsv file with the columns being displayed (you can add or remove columns by clicking in Select Columns (circled in blue)). If you only need the gene-ids, you can have only that column selected.

Let me know if that's the info you need. NCBI Datasets also has a command line tool that might be useful for what you need.

enter image description here

score 1 · Answer 2 · 2021-12-30

Using EntrezDirect:

$ esearch -db gene -query "Ulva compressa [orgn] AND alive [prop]" | efetch -format tabular
tax_id  Org_name    GeneID  CurrentID   Status  Symbol  Aliases description other_designations  map_location    chromosome  genomic_nucleotide_accession.version    start_position_on_the_genomic_accession end_position_on_the_genomic_accession   orientation exon_count  OMIM
63659   Ulva compressa  39331125    0   live    nad1    E2297_mgp07 NADH-ubiquinone oxidoreductase chain 1  NADH-ubiquinone oxidoreductase chain 1      MT  NC_041082.1 59438   60409   plus    0
63659   Ulva compressa  39331114    0   live    rpl14   E2297_mgp21 ribsomal protein L14    ribsomal protein L14        MT  NC_041082.1 39735   40196   plus    0
63659   Ulva compressa  39331113    0   live    nad3    E2297_mgp16 NADH-ubiquinone oxidoreductase chain 3  NADH-ubiquinone oxidoreductase chain 3      MT  NC_041082.1 46473   46826   plus    0
63659   Ulva compressa  39331106    0   live    rps12   E2297_mgp06 ribosomal protein S12   ribosomal protein S12       MT  NC_041082.1 60863   61231   plus    0