Question

NCBI genome version vs published genome version - what's 'better'?

1

Entering edit mode

6.4 years ago

Biogeek ▴ 480

For organism of choice: A sea anemone (Exaiptasia), there are two versions of the genome available. The original published genome and the NCBI based genome.

The NCBI version has a reduced number of mRNAs and predicted peptides (~2000 less) compared to the original published genome files. I'm aware that when raw files are uploaded, NCBI run their Eukaryotic Genome Annotation Pipeline (splign, pro-splign, Genomon) and provide an 'updated' / 'their version' of the genome? I've also noted their are 5 'new' genes which the NCBI version has and the old genome doesn't when I use grep to compare presence and absence of geneIDs.

Come to doing RNA-seq alignment, I would assume it's best to use the most up-to-date NCBI version rather than the original published genome?

Sorry if this is a 'noob question' to ask.

Thanks.

genome resources ncbi genome published genome • 1.7k views

ADD COMMENT • link 6.4 years ago by Biogeek ▴ 480

1

Entering edit mode

Depends what your objective is I think. I imagine the original sequence may have included some hand-curated annotations. The NCBI reanalysis is likely on the conservative side. If there are particular features of interest that are in the original but not the NCBI one, I'd say you're well within reason to use the original.

Most people, however, are going to use what's in NCBI for analyses, especially on large scale, so its likely that results will be more consistent with other papers and future publications if using the NCBI one.

In short, it depends on your priorities/questions I'd say.

ADD REPLY • link 6.4 years ago by Joe 22k

1

Entering edit mode

Thanks for your time jrj.healey.

I imagined that the NCBI version had been more conservatively reviewed and curated - potential for discarding useful info. Now there is a Genbank and Refseq version. I think I will use V1.1 Genbank (GCA_) as that's the official files which the authors of the genome paper submitted. I suspect the authors have reviewed and done further redundancy removal between V1.0 and V1.1. The refseq version (GCF_), I'll pass on. From what I've visualised, the results do not change with respect to our experiment. It's been a worthwhile exercise delving into differences and how NCBI review/curate..

ADD REPLY • link 6.4 years ago by Biogeek ▴ 480

0

Entering edit mode

If there is a RefSeq version then you should use that. RefSeq entries are manually curated and should represent the best possible information available.

ADD REPLY • link 6.4 years ago by GenoMax 150k

0

Entering edit mode

genomax. I agree BUT... what if this species has many clade-specific genes. Surely refseq will become a limiting database/ step??

ADD REPLY • link 6.4 years ago by Biogeek ▴ 480

0

Entering edit mode

If that is the case do you think even the GenBank record is going to be sufficient? You may have to do de novo assembly on anything from your sample that does not map to GenBank/RefSeq assembly to see if there are real/additional genes in your genome then.

ADD REPLY • link 6.4 years ago by GenoMax 150k

1

Entering edit mode

It is not a 'noob question' and a real problem. It's good that you realise that. Most people do not pay attention to it. They don't realise that different annotations exist and just take the one by default that is the NCBI one.

That being said it's really hard to compare annotation and say which one is the best. You must be expert in the annotation domain to understand in details what has been done by the group that published the annotation (if they provide it in sufficient detail) and how works the NCBI pipeline and the data they used. Even knowing the pipelines/approaches in details it some cases it can be still very hard to guess which annotation is the best. In some cases NCBI is better in some others not. A colleague of mine was part of a project where they did a hugue work to check manually all the genes of an yeast annotation one by one (consoritum/jamboree) but once submitted, NCBI did an automatic annotation. At the end people use the NCBI annotation while the annotation that has been manually curated is much more trustable.

I would say that usually the published version is slightly better because expert of the species / or expert in specific gene families have often look into details the annotation and provided feedback while the NCBI do that in a more automated way that they maybe do not take into account some peculiarities.

In the other hand the first genome version could also have been done by a Ph.D. freshly recruited that launched a pipeline without knowing what he was really doing, the result was good enough to answer their scientific question and to publish a paper. But the annotation could be worse than a one done by good pipeline as the one used at NCBI.

ADD REPLY • link 6.4 years ago by Juke34 9.2k

0

Entering edit mode

I would say that usually the published version is slightly better because expert of the species / or expert in specific gene families have often look into details the annotation and provided feedback while the NCBI do that in a more automated way that they maybe do not take into account some peculiarities.

If the genome was sequenced by a consortium of labs (or at least more than one lab that works on the organism) then that may be true. But as you rightly point out, if it was done by

a Ph.D. freshly recruited that launched a pipeline without knowing what is was really doing

then NCBI's version may be better (since at some level someone must have looked at the results before releasing it into database).

ADD REPLY • link 6.4 years ago by GenoMax 150k

0

Entering edit mode

Dear @Juke-34 I know the correct refseq file is cds_from_genomic.fna.gz but in the case there is no RefSeq what is the preferable file with nucleotide CDS from the GenBank assembly? Just the _genomic.fna.gz one? Thanks

ADD REPLY • link 5.5 years ago by dr3lostorage • 0