I'm trying to build the correct Entrez query in order to get the informations for complete eukaryotic genomes from the NCBI Genome database.
The genome browser (http://www.ncbi.nlm.nih.gov/genome/browse/) displays 185 entries when searching complete eukaryotic genomes.
I've been trying these :
eukaryota[organism] AND complete[status] ; entries count = 319
eukaryota[organism] AND complete[status] AND "genome sequencing"[Project Type] ; count = 300
Any ideas on either the best query to do what I want or which query corresponds to what is displayed in the browser ?
No, I was trying to reproduce the genome browser output for complete eukaryotic genomes, using Entrez. That's why I started comparing the numbers of complete genomes, to see if my queries were corrects.
Actually I want to get the informations like assembly ID, taxon ID, number of loci, % GC etc… for all complete eukaryotic genomes using BioPerl and Entrez.
The problem is, if what I get through Entrez queries is different from genome browser's informations, which one do I choose ? And is there a query that would give the same output ?
I'm not convinced that the data on that page can be retrieved via Entrez.
If you follow the link to the FTP site and download the file eukaryotes.txt, you'll see a field named Status. This is where the value of 185 comes from - I opened this file in R:
euk <- read.table("eukaryotes.txt", header = T, sep = "\t", stringsAsFactors = F, comment.char = "", quote = "")
table(euk$Status)
# Chromosomes No data Scaffolds or contigs
# 185 1609 722
# SRA or Traces
# 455
However, if you experiment with the Advanced query builder at the NCBI website, you'll find that:
database Genome has field Status, but "chromosomes" is not a valid value
databases Bioproject and Assembly do not have field Status
So it may be that there is no direct relation to the Entrez databases. Or I may be wrong and it's just very difficult to formulate the query :)
That's right but it feels weird that NCBI doesn't use the content of its databases to generate this file...
I started using that file, since it already contains most of the informations I need. It's just, I'm not very comfortable with working on it while not knowing how its generated and if it corresponds or not to NCBI databases content.
Edit : An interesting fact is that "eukaryota"[organism] gives me like 2100 lines and the eukaryotes section in genome browser is more like 2900…
Hello!
What kind of information do you want exactly?
Just the number of complete genomes?
No, I was trying to reproduce the genome browser output for complete eukaryotic genomes, using Entrez. That's why I started comparing the numbers of complete genomes, to see if my queries were corrects. Actually I want to get the informations like assembly ID, taxon ID, number of loci, % GC etc… for all complete eukaryotic genomes using BioPerl and Entrez. The problem is, if what I get through Entrez queries is different from genome browser's informations, which one do I choose ? And is there a query that would give the same output ?