Question

Level of data/function redundancy between GOLD and NCBI Genome databases

1

Entering edit mode

9.9 years ago

bioinformaticsscientist ▴ 20

I hope this question is not too silly or too broad.

For many years, to get genome data, I have always been going directly to Genbank using all kinds of api that come with either perl or python and find the information from the site to be quite sufficient for whatever analysis I had in mind. Also, for the longest time, I am aware of multiple other resources such as GOLD, that provides list of genome projects, data downloads, genome maps, statistics, etc, that are supposedly better in curating genomic data (simply because it is specialized for that very purpose). However, since I have very little experience of those databases and apparently they have some value since they are still here after many years, I am wondering what I am missing out by sticking with Genbank alone.

I guess my question is, what is the level of data redundancy between these databases; function-wise, what are the different target audiences of different genome data portals including GOLD and others as well.

thanks in advance.

database Genbank Genome GOLD • 2.2k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by bioinformaticsscientist ▴ 20

0

Entering edit mode

Have you... have you tried Ensembl?

ADD REPLY • link 2.7 years ago by Ram 44k

1

Entering edit mode

Thanks for the comment.

yes, I have visited Ensembl a few times.. it is kinda hard to miss something as famous as that. But similar to GOLD, to me this seems like just another place to get information on genomes and I haven't felt the necessity to go there. I am not familiar with these and hence the question about the relationships between these databases, are the data synchronized, which one is more up-to-date, etc.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by bioinformaticsscientist ▴ 20

0

Entering edit mode

I was just trying to make a bad joke - the more data sources we look at, the more redundancy we'll see. As someone that works on human genetics, I stick to UCSC genome browser, and it gives me ref links to any other DB that I might be interested in.

ADD REPLY • link 2.7 years ago by Ram 44k

0

Entering edit mode

Oh... Sorry about my lack of sense of humor (guess I was too nervous to joke since this is like my second post here at BioStars). From the answer below and your comment, it does feel like that this might be a common issue (which makes me feel better). Maybe a review paper or editorial on this topic on Genome Res is in order? :)

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by bioinformaticsscientist ▴ 20

0

Entering edit mode

That's OK, people rarely make jokes in discussion forums (humorous remarks can feel offensive when lost in translation).

This is indeed a common issue, and it at times drives a lot of folks that crave standards insane. The good part is that once you accept the redundancy and move on, you benefit from it :)

ADD REPLY • link 2.7 years ago by Ram 44k

Ram · Answer 1 · 2015-01-06

You could try searching for a species and comparing the results in different databases. For "Yersinia pestis", I find 138 sequencing projects in GOLD, which is about the same as searching NCBI bioproject for "Yersinia pestis[ORGN] AND genome sequencing[Filter]". In my experience, these are nearly the same for Bacteria, although GOLD tries to store minimum information standards that nobody fills out. You can also check NCBI genomes (go to browse genomes) and find 225 Y. pestis genomes. This is the same table that's in ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt and is updated daily, and the extra genomes are from projects that sequence multiple strains like 47685. For some reason, this table is missing 28 genomes in GOLD or NCBI Bioproject like Java 9, Pestoides B, C, D, or E and others, so I always combine genomes and missing bioprojects if I need a list of sequenced strains. You can also get a list of bioprojects using Entrez direct below to compare.

esearch -db bioproject -query 'Yersinia pestis[Organism] AND genome sequencing[Filter] AND scope monoisolate[Filter]'   | esummary > yp.out
## lots of different tags here (or use efetch xml format for locus tag prefixes and more fields)
cat yp.out | xtract -pattern DocumentSummary -element Project_Id TaxId Organism_Name

If your goal is to find every sequenced Y. pestis strain, then you probably should check NCBI Biosample and you'll pick up a few additional strains in the SRA only "Yersinia pestis[ORGN] AND biosample sra[filter]", but I do not know of a single source that lists all ~300 sequenced strains. Other databases like Ensembl and Patric have annotated genomes only (90 and 107 respectively)