RefSeq and Genbank
1
0
Entering edit mode
3.0 years ago
Julia • 0

what is the difference between genbank and refseq

refseq genbank • 3.4k views
ADD COMMENT
2
Entering edit mode
3.0 years ago
GenoMax 147k

GenBank - https://www.ncbi.nlm.nih.gov/genbank/

GenBank ® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences

RefSeq - https://www.ncbi.nlm.nih.gov/refseq/about/

The Reference Sequence (RefSeq) collection provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins.

There is a FAQ question that directly answers parent question in this thread. It can be found here.

ADD COMMENT
0
Entering edit mode

The Reference Sequence (RefSeq) collection provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins.

I think the above is quite buzzword-heavy and does not explain what RefSeq actually is. What does "comprehensive" and "integrated" really mean? I am not sure.

In a nutshell, I believe that Refseq curators take entries submitted to Genbank and designate one GenBank entry as a representative sequence for each organism (this is the non-redundant part). They probably pick a representative sequence that is well annotated. But frankly, I could be wrong. I am assuming this is what happens.

For example for the millions of SARS-COV-2 sequences out there, there is only one "reference" sequence, that also has both a GenBank and a RefSeq id, for the exact same sequence.

ADD REPLY
1
Entering edit mode

RefSeq curation process is described in the link above. Main point is RefSeq records are manually curated. These records are owned by NCBI so they have complete control over the content as opposed to GenBank entries which are owned by submitters (and this can and do contain errors at times).

RefSeq records are derived from publicly available sequence data; varying levels of validation, additional annotation, and manual curation are applied to the RefSeq record.

RefSeq genomes is a separate section. Prokaryotic RefSeq genomes are described on this page.

There are many GenBank genomes for SARS-CoV-2 but only a couple are designated as RefSeq (GCF* accession, listing truncated to save space):

$ esearch -db assembly -query "SARS-CoV-2" | esummary | xtract -pattern DocumentSummary -element Id,FtpPath_GenBank,FtpPath_RefSeq

15960888    ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/937/895/GCA_009937895.1_ASM993789v1
15960868    ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/937/885/GCA_009937885.1_ASM993788v1
15851418    ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/858/895/GCA_009858895.3_ASM985889v3  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3
15793598    ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/858/895/GCA_009858895.2_ASM985889v2
15778738    ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/858/895/GCA_009858895.1_ASM985889v1  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.1_ASM985889v1
ADD REPLY

Login before adding your answer.

Traffic: 1586 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6