Using the rentrez
package.
First, define a function to extract the refseq category :
refseq_cat <- function(ids) {
sapply(ids, function(i) {
foo = entrez_summary(db = "assembly", id = i)
foo$refseq_category
}, USE.NAMES = F)
}
The code should work when the assembly is labeled as 'representative genome'.
# Load library
library(rentrez)
# Get the assembly entries for a given species
my_assembly = entrez_search(db = "assembly", term = "Gorilla gorilla[ORGN]")
# Get the index for the 'representative genome' from those entries
rep_genome_idx = which(refseq_cat(my_assembly$ids) == "representative genome")
# Get the id. of the 'representative genome' (if any)
foo = entrez_summary(db = "assembly", id= my_assembly$ids[rep_genome_idx])
foo$assemblyaccession
[1] "GCF_029281585.2"
Idem for 'Cicer arietinum'
my_assembly = entrez_search(db = "assembly", term = "Cicer arietinum[ORGN]")
rep_genome_idx = which(refseq_cat(my_assembly$ids) == "representative genome")
foo = entrez_summary(db = "assembly", id= my_assembly$ids[rep_genome_idx])
foo$assemblyaccession
[1] "GCF_000331145.1"
Idem for 'Felis catus'
my_assembly = entrez_search(db = "assembly", term = "Felis catus[ORGN]")
rep_genome_idx = which(refseq_cat(my_assembly$ids) == "representative genome")
foo = entrez_summary(db = "assembly", id= my_assembly$ids[rep_genome_idx])
foo$assemblyaccession
[1] "GCF_018350175.1"
NCBI
datasets
also works but returns JSONGenoMax thank you so much for your reply. The problem is I want only one reference assembly for each species and not the all versions, and I dont know how to specify that in the command using esearch. If there is any such option, then I will just create a loop for that and get all the accession numbers.