Question

map refseq to identical genbank

0

Entering edit mode

5.1 years ago

cmo ▴ 90

Some databases provide Genbank coordinates. Others provide RefSeq coordinates. I am looking for a table of pairwise associations between identical Genbank and Refseq records.

The ultimate goal is: if I have e.g. a BED file with tens of thousands of annotations across thousands of Genbank genomes, I would like to replace the Genbank accession ID with the identical Refseq accesion id for each line in the BED fle, provided such a corresponding Refseq ID exists.

Yes, RefSeq is a curated subset of Genbank that has been copied and so the records are technically distinct. However, the RefSeq geome pages on NCB provide a link to "Identical Genbank Sequence" For example: RefSeq genome page for E.coli MG1655 provides a link to "Identical Genbank Sequence"

ncbi refseq genbank assembly • 2.9k views

ADD COMMENT • link 5.1 years ago by cmo ▴ 90

2

Entering edit mode

5.1 years ago

GenoMax 147k

Using Entrezdirect:

$ esearch -db nuccore -query "NC_000913" | efetch -format docsum | xtract -pattern DocumentSummary -element Caption, Title, AssemblyAcc
NC_000913       Escherichia coli str. K-12 substr. MG1655, complete genome      U00096

ADD COMMENT • link 5.1 years ago by GenoMax 147k

0

Entering edit mode

good idea, but this will not scale to thousands of accessions, as indicated in the question.

ADD REPLY • link 5.1 years ago by cmo ▴ 90

1

Entering edit mode

5.1 years ago

ctseto ▴ 310

FastANI GenBank vs RefSeq? Though I imagine NCBI has the structured relationships encoded somewhere, which would save a bunch of computer cycles.

ADD COMMENT • link 5.1 years ago by ctseto ▴ 310

0

Entering edit mode

yes, i imagined the relationship is encoded somewhere. interesting idea, though.

ADD REPLY • link 5.1 years ago by cmo ▴ 90

0

Entering edit mode

Looks like NCBI already did this?

wget https://ftp.ncbi.nih.gov/genomes/ASSEMBLY_REPORTS/ANI_report_bacteria.txt

head ANI_report_bacteria.txt

genbank-accession     refseq-accession        annot-date      taxid   species-taxid   organism-name   species-name    assembly-name   ANI-species-name        ANI-type-assembly       ANI-type-category       Typestrain-ANI  ANI-QCoverage   ANI-SCoverage   ANI-status      Submitted-species-name  Submitted-type-assembly Submitted-type-category Submitted-ANI   Submitted-QCoverage     Submitted-SCoverage     contig-count    genome-length   contig-N50      contig-L50      species-asm-count       species-avg-cds-count

GCA_000006625.1 GCF_000006625.1 2017/04/06      273119  134821  Ureaplasma parvum serovar 3 str. ATCC 700970    Ureaplasma parvum       ASM662v1        Ureaplasma parvum       GCA_000019345.1 type    99.9918 99.99   99.99   species-match   Ureaplasma parvum serovar 3 str. ATCC 700970    GCA_000019345.1 type    99.9918 99.99   99.99   1.00    751719.00       751719  1       13      590.308

Looks like column1 and column2 are GenBank and RefSeq, might float your boat?

wget https://ftp.ncbi.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt

head assembly_summary_genbank.txt

See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
assembly_accession    bioproject      biosample       wgs_master      refseq_category taxid   species_taxid   organism_name   infraspecific_name      isolate version_status  assembly_level  release_type    genome_rep   seq_rel_date    asm_name        submitter       gbrs_paired_asm paired_asm_comp ftp_path        excluded_from_refseq    relation_to_type_material
GCA_000001215.4 PRJNA13812      SAMN02803731            reference genome        7227    7227    Drosophila melanogaster                 latest  Chromosome      Major   Full    2014/08/01      Release 6 plus ISO1 MT       The FlyBase Consortium/Berkeley Drosophila Genome Project/Celera Genomics       *GCF_000001215.4* identical       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/215/GCA_000001215.4_Release_6_plus_ISO1_MT

GCA_000001405.28        PRJNA31257                      reference genome        9606    9606    Homo sapiens                    latest  Chromosome      Patch   Full    2019/02/28      GRCh38.p13      Genome Reference Consortium  *GCF_000001405.39*        different       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13

ADD REPLY • link updated 5.1 years ago by GenoMax 147k • written 5.1 years ago by ctseto ▴ 310

1

Entering edit mode

5.1 years ago

vkkodali_ncbi ★ 3.8k

For the "Identical Genbank Sequence" link, you can use edirect as shown below:

elink -db nucleotide -id NC_000913.3 -target nucleotide -name nuccore_nuccore_rsgb | efetch -format acc

ADD COMMENT • link 5.1 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

good idea, but this will not scale to thousands of accessions, as indicated in the question.

ADD REPLY • link 5.1 years ago by cmo ▴ 90

1

Entering edit mode

This should be fine for a few thousand accessions. What is the scale here? Tens of thousands? Hundreds of thousands? And scope? Bacteria only, higher eukaryotes, etc?

The NCBI Genomes FTP path has an assembly_report.txt file for each RefSeq assembly that contains RefSeq and GenBank mapping. It may make more sense to download all of the assembly_report.txt files first from FTP, concatenate them and make your own mapping database locally.

ADD REPLY • link 5.1 years ago by vkkodali_ncbi ★ 3.8k

score 3 · Accepted Answer · 2019-10-15

This information is in the /ASSEMBLY_REPORTS/ directory on the Genomes FTP site:

ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/

I contacted NCBI, and NLM Support (nlm-support@nlm.nih.gov) provided the following answer:

As you will learn in the README file, columns 18 and 19 in the assembly_summary files will give you such pairing:

Column 18: "gbrs_paired_asm" GenBank/RefSeq paired assembly: the accession.version of the GenBank assembly that is paired to the given RefSeq assembly, or vice-versa. "na" is reported if the assembly is unpaired.

Column 19: "paired_asm_comp" Paired assembly comparison: whether the paired GenBank & RefSeq assemblies are identical or different. Values: identical - GenBank and RefSeq assemblies are identical different - GenBank and RefSeq assemblies are not identical na - not applicable since the assembly is unpaired

And it actually works.