map refseq to identical genbank
4
0
Entering edit mode
5.2 years ago
cmo ▴ 90

Some databases provide Genbank coordinates. Others provide RefSeq coordinates. I am looking for a table of pairwise associations between identical Genbank and Refseq records.

The ultimate goal is: if I have e.g. a BED file with tens of thousands of annotations across thousands of Genbank genomes, I would like to replace the Genbank accession ID with the identical Refseq accesion id for each line in the BED fle, provided such a corresponding Refseq ID exists.

Yes, RefSeq is a curated subset of Genbank that has been copied and so the records are technically distinct. However, the RefSeq geome pages on NCB provide a link to "Identical Genbank Sequence" For example: RefSeq genome page for E.coli MG1655 provides a link to "Identical Genbank Sequence"

ncbi refseq genbank assembly • 2.9k views
ADD COMMENT
3
Entering edit mode
5.1 years ago
cmo ▴ 90

This information is in the /ASSEMBLY_REPORTS/ directory on the Genomes FTP site:

ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/

I contacted NCBI, and NLM Support (nlm-support@nlm.nih.gov) provided the following answer:

As you will learn in the README file, columns 18 and 19 in the assembly_summary files will give you such pairing:

Column 18: "gbrs_paired_asm" GenBank/RefSeq paired assembly: the accession.version of the GenBank assembly that is paired to the given RefSeq assembly, or vice-versa. "na" is reported if the assembly is unpaired.

Column 19: "paired_asm_comp" Paired assembly comparison: whether the paired GenBank & RefSeq assemblies are identical or different. Values: identical - GenBank and RefSeq assemblies are identical different - GenBank and RefSeq assemblies are not identical na - not applicable since the assembly is unpaired

And it actually works.

ADD COMMENT
2
Entering edit mode
5.2 years ago
GenoMax 147k

Using Entrezdirect:

$ esearch -db nuccore -query "NC_000913" | efetch -format docsum | xtract -pattern DocumentSummary -element Caption, Title, AssemblyAcc
NC_000913       Escherichia coli str. K-12 substr. MG1655, complete genome      U00096
ADD COMMENT
0
Entering edit mode

good idea, but this will not scale to thousands of accessions, as indicated in the question.

ADD REPLY
1
Entering edit mode
5.2 years ago
ctseto ▴ 310

FastANI GenBank vs RefSeq? Though I imagine NCBI has the structured relationships encoded somewhere, which would save a bunch of computer cycles.

ADD COMMENT
0
Entering edit mode

yes, i imagined the relationship is encoded somewhere. interesting idea, though.

ADD REPLY
0
Entering edit mode

Looks like NCBI already did this?

wget https://ftp.ncbi.nih.gov/genomes/ASSEMBLY_REPORTS/ANI_report_bacteria.txt

head ANI_report_bacteria.txt

genbank-accession     refseq-accession        annot-date      taxid   species-taxid   organism-name   species-name    assembly-name   ANI-species-name        ANI-type-assembly       ANI-type-category       Typestrain-ANI  ANI-QCoverage   ANI-SCoverage   ANI-status      Submitted-species-name  Submitted-type-assembly Submitted-type-category Submitted-ANI   Submitted-QCoverage     Submitted-SCoverage     contig-count    genome-length   contig-N50      contig-L50      species-asm-count       species-avg-cds-count

GCA_000006625.1 GCF_000006625.1 2017/04/06      273119  134821  Ureaplasma parvum serovar 3 str. ATCC 700970    Ureaplasma parvum       ASM662v1        Ureaplasma parvum       GCA_000019345.1 type    99.9918 99.99   99.99   species-match   Ureaplasma parvum serovar 3 str. ATCC 700970    GCA_000019345.1 type    99.9918 99.99   99.99   1.00    751719.00       751719  1       13      590.308

Looks like column1 and column2 are GenBank and RefSeq, might float your boat?

wget https://ftp.ncbi.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt

head assembly_summary_genbank.txt

See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
assembly_accession    bioproject      biosample       wgs_master      refseq_category taxid   species_taxid   organism_name   infraspecific_name      isolate version_status  assembly_level  release_type    genome_rep   seq_rel_date    asm_name        submitter       gbrs_paired_asm paired_asm_comp ftp_path        excluded_from_refseq    relation_to_type_material
GCA_000001215.4 PRJNA13812      SAMN02803731            reference genome        7227    7227    Drosophila melanogaster                 latest  Chromosome      Major   Full    2014/08/01      Release 6 plus ISO1 MT       The FlyBase Consortium/Berkeley Drosophila Genome Project/Celera Genomics       *GCF_000001215.4* identical       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/215/GCA_000001215.4_Release_6_plus_ISO1_MT

GCA_000001405.28        PRJNA31257                      reference genome        9606    9606    Homo sapiens                    latest  Chromosome      Patch   Full    2019/02/28      GRCh38.p13      Genome Reference Consortium  *GCF_000001405.39*        different       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13
ADD REPLY
1
Entering edit mode
5.2 years ago
vkkodali_ncbi ★ 3.8k

For the "Identical Genbank Sequence" link, you can use edirect as shown below:

elink -db nucleotide -id NC_000913.3 -target nucleotide -name nuccore_nuccore_rsgb | efetch -format acc
ADD COMMENT
0
Entering edit mode

good idea, but this will not scale to thousands of accessions, as indicated in the question.

ADD REPLY
1
Entering edit mode

This should be fine for a few thousand accessions. What is the scale here? Tens of thousands? Hundreds of thousands? And scope? Bacteria only, higher eukaryotes, etc?

The NCBI Genomes FTP path has an assembly_report.txt file for each RefSeq assembly that contains RefSeq and GenBank mapping. It may make more sense to download all of the assembly_report.txt files first from FTP, concatenate them and make your own mapping database locally.

ADD REPLY

Login before adding your answer.

Traffic: 1861 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6