Getting refseq assembly accession numbers from list of species names
3
0
Entering edit mode
8 months ago
Kuldeep • 0

Hello,

I am trying to build a custom haystac database. I have a list of species. Is there any program which I can use so that I can use this list of species names as an input and get the list of refseq accession numbers of the genome assemblies of the respective species? I tried to use the "datasets summary" command after following [API for NCBI Accession ID (GenBank or RefSeq ) generation from a list of species names?]. But it did not work.

Can you please help?

Thank you :)

haystac accession refseq assembly ncbi • 888 views
ADD COMMENT
0
Entering edit mode

NCBI datasets also works but returns JSON

$ datasets summary genome taxon "Gorilla gorilla" --reference --assembly-version all

{"reports": [{"accession":"GCF_029281585.2","annotation_info":{"busco":{"busco_lineage":"primates_odb10","busco_ver":"4.1.4","complete":0.9904209,"duplicated":0.014731495,"fragmented":0.002539913,"missing":0.007039187,"single_copy":0.9756894,"total_count":"13780"},"method":"Best-placed RefSeq; Gnomon; cmsearch; tRNAscan-SE","name":"GCF_029281585.2-RS_2024_02","pipeline":"NCBI eukaryotic genome annotation pipeline","provider":"NCBI RefSeq","release_date":"2024-02-27","report_url":"https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Gorilla_gorilla_gorilla/GCF_029281585.2-RS_2024_02.html","software_version":"10.2","stats":{"gene_counts":{"non_coding":11260,"other":197,"protein_coding":22635,"pseudogene":7077,"total":41169}},"status":"Full annotation"},"assembly_info":{"assembly_level":"Complete Genome","assembly_method":"Verkko v. 1.4","assembly_name":"NHGRI_mGorGor1-v2.0_pri","assembly_status":"current","assembly_type":"haploid","bioproject_accession":"PRJNA942267","bioproject_lineage":[{"bioprojects":[{"accession":"PRJNA942267","parent_accessions":["PRJNA930782"],"title":"Gorilla gorilla gorilla Genome sequencing and assembly"},{"accession":"PRJNA930782","title":"Gorilla gorilla gorilla Umbrella project"}]}],"biosample":{"accession":"SAMN04003007","attributes":[{"name":"isolate","value":"KB3781"},{"name":"dev_stage","value":"Mature"},{"name":"sex","value":"male"},{"name":"tissue","value":"Fibroblast"},{"name":"sub_species","value":"gorilla"}],"description":{"organism":{"organism_name":"Gorilla gorilla gorilla","tax_id":9595},"title":"Gorilla Y flow-sorted DNA"},"last_updated":"2020-03-26T10:53:53.166","models":["Model organism or animal"],"owner":{"contacts":[{}],"name":"The Pennsylvania State University"},"package":"Model.organism.animal.1.0","publication_date":"2015-11-01T00:00:00.000","sample_ids":[{"label":"Sample name","value":"GgoY_KB3781"},{"db":"SRA","value":"SRS1041398"}],"status":{"status":"live","when":"2015-12-31T08:47:16"},"submission_date":"2015-08-20T12:43:26.000"},"blast_url":"https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch\u0026PROG_DEF=blastn\u0026BLAST_SPEC=GDH_GCF_029281585.2","comments":"This is the goldenpath principal haplotype containing all chromosomes. The maternal (JAQQLJ000000000) and paternal (JAQQLK000000000) haplotypes are also available.","paired_assembly":{"accession":"GCA_029281585.2","only_refseq":"chromosome MT","status":"current"},"refseq_category":"representative genome","release_date":"2024-01-08","sequencing_tech":"PacBio Sequel; Oxford Nanopore PromethION","submitter":"National Human Genome Research Institute, National Institutes of Health"},"assembly_stats":{"contig_l50":10,"contig_n50":150804130,"gc_count":"1434852044","gc_percent":40.5,"genome_coverage":"109.0x","number_of_component_sequences":25,"number_of_contigs":27,"number_of_organelles":1,"number_of_scaffolds":25,"scaffold_l50":10,"scaffold_n50":150804130,"total_number_of_chromosomes":25,"total_sequence_length":"3545834224","total_ungapped_length":"3543834224"},"current_accession":"GCF_029281585.2","organelle_info":[{"description":"Mitochondrion","submitter":"National Human Genome Research Institute, National Institutes of Health","total_seq_length":"16412"}],"organism":{"common_name":"western lowland gorilla","infraspecific_names":{"isolate":"KB3781","sex":"male"},"organism_name":"Gorilla gorilla gorilla","tax_id":9595},"paired_accession":"GCA_029281585.2","source_database":"SOURCE_DATABASE_REFSEQ","wgs_info":{"master_wgs_url":"https://www.ncbi.nlm.nih.gov/nuccore/JARGGP000000000.2","wgs_contigs_url":"https://www.ncbi.nlm.nih.gov/Traces/wgs/JARGGP02","wgs_project_accession":"JARGGP02"}}],"total_count": 1}

ADD REPLY
0
Entering edit mode

GenoMax thank you so much for your reply. The problem is I want only one reference assembly for each species and not the all versions, and I dont know how to specify that in the command using esearch. If there is any such option, then I will just create a loop for that and get all the accession numbers.

ADD REPLY
4
Entering edit mode
8 months ago
GenoMax 148k

Using EntrezDirect:

$ esearch -db assembly -query "Gorilla gorilla" | efetch -format docsum | xtract -pattern DocumentSummary -element RefSeq
GCF_029281585.2
GCF_029281585.1
GCF_008122165.1
GCF_000151905.2
GCF_000151905.1

$ esearch -db assembly -query "Gallus gallus" | efetch -format docsum | xtract -pattern DocumentSummary -element RefSeq
GCF_016700215.2
GCF_016699485.2
GCF_016700215.1
GCF_000002315.6
GCF_000002315.5
GCF_000002315.4
GCF_000002315.3
GCF_000002315.2
GCF_000002315.1
ADD COMMENT
0
Entering edit mode
8 months ago
josev.die ▴ 70

Using the rentrezpackage.

First, define a function to extract the refseq category :

refseq_cat <- function(ids) {
 sapply(ids, function(i) {
 foo = entrez_summary(db = "assembly", id = i)
 foo$refseq_category
  }, USE.NAMES = F)
} 

The code should work when the assembly is labeled as 'representative genome'.

# Load library
library(rentrez)

# Get the assembly entries for a given species 
my_assembly  = entrez_search(db = "assembly", term = "Gorilla gorilla[ORGN]")

# Get the index for the 'representative genome'  from those entries
rep_genome_idx = which(refseq_cat(my_assembly$ids) == "representative genome")


# Get the id. of the 'representative genome' (if any)  
foo = entrez_summary(db = "assembly",   id= my_assembly$ids[rep_genome_idx])
foo$assemblyaccession
[1] "GCF_029281585.2"

Idem for 'Cicer arietinum'

my_assembly  = entrez_search(db = "assembly", term = "Cicer arietinum[ORGN]")
rep_genome_idx = which(refseq_cat(my_assembly$ids) == "representative genome")
foo = entrez_summary(db = "assembly",   id= my_assembly$ids[rep_genome_idx])
foo$assemblyaccession
[1] "GCF_000331145.1"

Idem for 'Felis catus'

my_assembly  = entrez_search(db = "assembly", term = "Felis catus[ORGN]")
rep_genome_idx = which(refseq_cat(my_assembly$ids) == "representative genome")
foo = entrez_summary(db = "assembly",   id= my_assembly$ids[rep_genome_idx])
foo$assemblyaccession
[1] "GCF_018350175.1"
ADD COMMENT
0
Entering edit mode
8 months ago
MirianT_NCBI ▴ 770

Hi, You can do that using datasets, but you should use the flag --reference. That should return the current reference genome for the taxa requested. Like this:

datasets summary genome taxon "gorilla gorilla" --reference --as-json-lines | dataformat tsv genome --fields organism-name,accession

Organism Name   Assembly Accession
Gorilla gorilla gorilla GCF_029281585.2

You can loop through your list of taxa (one per line, plain text) like this:

cat taxa.list | while read TAXON; do 
   datasets summary genome taxon "$TAXON" --reference --as-json-lines | \
   dataformat tsv genome --fields organism-name,accession --elide-header; 
done

Gorilla gorilla gorilla GCF_029281585.2
Homo sapiens    GCF_000001405.40
Mustela putorius furo   GCF_011764305.1
Gallus gallus   GCF_016699485.2

Let me know if you have any other questions or if this solution doesn't work for you.

ADD COMMENT

Login before adding your answer.

Traffic: 1975 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6