Question

Getting refseq assembly accession numbers from list of species names

0

Entering edit mode

8 months ago

Kuldeep • 0

Hello,

I am trying to build a custom haystac database. I have a list of species. Is there any program which I can use so that I can use this list of species names as an input and get the list of refseq accession numbers of the genome assemblies of the respective species? I tried to use the "datasets summary" command after following [API for NCBI Accession ID (GenBank or RefSeq ) generation from a list of species names?]. But it did not work.

Can you please help?

Thank you :)

haystac accession refseq assembly ncbi • 887 views

ADD COMMENT • link updated 8 months ago by GenoMax 148k • written 8 months ago by Kuldeep • 0

0

Entering edit mode

NCBI datasets also works but returns JSON

$ datasets summary genome taxon "Gorilla gorilla" --reference --assembly-version all

{"reports": [{"accession":"GCF_029281585.2","annotation_info":{"busco":{"busco_lineage":"primates_odb10","busco_ver":"4.1.4","complete":0.9904209,"duplicated":0.014731495,"fragmented":0.002539913,"missing":0.007039187,"single_copy":0.9756894,"total_count":"13780"},"method":"Best-placed RefSeq; Gnomon; cmsearch; tRNAscan-SE","name":"GCF_029281585.2-RS_2024_02","pipeline":"NCBI eukaryotic genome annotation pipeline","provider":"NCBI RefSeq","release_date":"2024-02-27","report_url":"https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Gorilla_gorilla_gorilla/GCF_029281585.2-RS_2024_02.html","software_version":"10.2","stats":{"gene_counts":{"non_coding":11260,"other":197,"protein_coding":22635,"pseudogene":7077,"total":41169}},"status":"Full annotation"},"assembly_info":{"assembly_level":"Complete Genome","assembly_method":"Verkko v. 1.4","assembly_name":"NHGRI_mGorGor1-v2.0_pri","assembly_status":"current","assembly_type":"haploid","bioproject_accession":"PRJNA942267","bioproject_lineage":[{"bioprojects":[{"accession":"PRJNA942267","parent_accessions":["PRJNA930782"],"title":"Gorilla gorilla gorilla Genome sequencing and assembly"},{"accession":"PRJNA930782","title":"Gorilla gorilla gorilla Umbrella project"}]}],"biosample":{"accession":"SAMN04003007","attributes":[{"name":"isolate","value":"KB3781"},{"name":"dev_stage","value":"Mature"},{"name":"sex","value":"male"},{"name":"tissue","value":"Fibroblast"},{"name":"sub_species","value":"gorilla"}],"description":{"organism":{"organism_name":"Gorilla gorilla gorilla","tax_id":9595},"title":"Gorilla Y flow-sorted DNA"},"last_updated":"2020-03-26T10:53:53.166","models":["Model organism or animal"],"owner":{"contacts":[{}],"name":"The Pennsylvania State University"},"package":"Model.organism.animal.1.0","publication_date":"2015-11-01T00:00:00.000","sample_ids":[{"label":"Sample name","value":"GgoY_KB3781"},{"db":"SRA","value":"SRS1041398"}],"status":{"status":"live","when":"2015-12-31T08:47:16"},"submission_date":"2015-08-20T12:43:26.000"},"blast_url":"https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch\u0026PROG_DEF=blastn\u0026BLAST_SPEC=GDH_GCF_029281585.2","comments":"This is the goldenpath principal haplotype containing all chromosomes. The maternal (JAQQLJ000000000) and paternal (JAQQLK000000000) haplotypes are also available.","paired_assembly":{"accession":"GCA_029281585.2","only_refseq":"chromosome MT","status":"current"},"refseq_category":"representative genome","release_date":"2024-01-08","sequencing_tech":"PacBio Sequel; Oxford Nanopore PromethION","submitter":"National Human Genome Research Institute, National Institutes of Health"},"assembly_stats":{"contig_l50":10,"contig_n50":150804130,"gc_count":"1434852044","gc_percent":40.5,"genome_coverage":"109.0x","number_of_component_sequences":25,"number_of_contigs":27,"number_of_organelles":1,"number_of_scaffolds":25,"scaffold_l50":10,"scaffold_n50":150804130,"total_number_of_chromosomes":25,"total_sequence_length":"3545834224","total_ungapped_length":"3543834224"},"current_accession":"GCF_029281585.2","organelle_info":[{"description":"Mitochondrion","submitter":"National Human Genome Research Institute, National Institutes of Health","total_seq_length":"16412"}],"organism":{"common_name":"western lowland gorilla","infraspecific_names":{"isolate":"KB3781","sex":"male"},"organism_name":"Gorilla gorilla gorilla","tax_id":9595},"paired_accession":"GCA_029281585.2","source_database":"SOURCE_DATABASE_REFSEQ","wgs_info":{"master_wgs_url":"https://www.ncbi.nlm.nih.gov/nuccore/JARGGP000000000.2","wgs_contigs_url":"https://www.ncbi.nlm.nih.gov/Traces/wgs/JARGGP02","wgs_project_accession":"JARGGP02"}}],"total_count": 1}

ADD REPLY • link 8 months ago by GenoMax 148k

0

Entering edit mode

GenoMax thank you so much for your reply. The problem is I want only one reference assembly for each species and not the all versions, and I dont know how to specify that in the command using esearch. If there is any such option, then I will just create a loop for that and get all the accession numbers.

ADD REPLY • link 8 months ago by Kuldeep • 0

score 4 · Answer 1 · 2024-04-04

Using EntrezDirect:

$ esearch -db assembly -query "Gorilla gorilla" | efetch -format docsum | xtract -pattern DocumentSummary -element RefSeq
GCF_029281585.2
GCF_029281585.1
GCF_008122165.1
GCF_000151905.2
GCF_000151905.1

$ esearch -db assembly -query "Gallus gallus" | efetch -format docsum | xtract -pattern DocumentSummary -element RefSeq
GCF_016700215.2
GCF_016699485.2
GCF_016700215.1
GCF_000002315.6
GCF_000002315.5
GCF_000002315.4
GCF_000002315.3
GCF_000002315.2
GCF_000002315.1

score 0 · Answer 2 · 2024-04-07

Using the rentrezpackage.

First, define a function to extract the refseq category :

refseq_cat <- function(ids) {
 sapply(ids, function(i) {
 foo = entrez_summary(db = "assembly", id = i)
 foo$refseq_category
  }, USE.NAMES = F)
}

The code should work when the assembly is labeled as 'representative genome'.

# Load library
library(rentrez)

# Get the assembly entries for a given species 
my_assembly  = entrez_search(db = "assembly", term = "Gorilla gorilla[ORGN]")

# Get the index for the 'representative genome'  from those entries
rep_genome_idx = which(refseq_cat(my_assembly$ids) == "representative genome")


# Get the id. of the 'representative genome' (if any)  
foo = entrez_summary(db = "assembly",   id= my_assembly$ids[rep_genome_idx])
foo$assemblyaccession
[1] "GCF_029281585.2"

Idem for 'Cicer arietinum'

my_assembly  = entrez_search(db = "assembly", term = "Cicer arietinum[ORGN]")
rep_genome_idx = which(refseq_cat(my_assembly$ids) == "representative genome")
foo = entrez_summary(db = "assembly",   id= my_assembly$ids[rep_genome_idx])
foo$assemblyaccession
[1] "GCF_000331145.1"

Idem for 'Felis catus'

my_assembly  = entrez_search(db = "assembly", term = "Felis catus[ORGN]")
rep_genome_idx = which(refseq_cat(my_assembly$ids) == "representative genome")
foo = entrez_summary(db = "assembly",   id= my_assembly$ids[rep_genome_idx])
foo$assemblyaccession
[1] "GCF_018350175.1"

score 0 · Answer 3 · 2024-04-08

Hi, You can do that using datasets, but you should use the flag --reference. That should return the current reference genome for the taxa requested. Like this:

datasets summary genome taxon "gorilla gorilla" --reference --as-json-lines | dataformat tsv genome --fields organism-name,accession

Organism Name   Assembly Accession
Gorilla gorilla gorilla GCF_029281585.2

You can loop through your list of taxa (one per line, plain text) like this:

cat taxa.list | while read TAXON; do 
   datasets summary genome taxon "$TAXON" --reference --as-json-lines | \
   dataformat tsv genome --fields organism-name,accession --elide-header; 
done

Gorilla gorilla gorilla GCF_029281585.2
Homo sapiens    GCF_000001405.40
Mustela putorius furo   GCF_011764305.1
Gallus gallus   GCF_016699485.2

Let me know if you have any other questions or if this solution doesn't work for you.