Some databases provide Genbank coordinates. Others provide RefSeq coordinates. I am looking for a table of pairwise associations between identical Genbank and Refseq records.
The ultimate goal is: if I have e.g. a BED file with tens of thousands of annotations across thousands of Genbank genomes, I would like to replace the Genbank accession ID with the identical Refseq accesion id for each line in the BED fle, provided such a corresponding Refseq ID exists.
Yes, RefSeq is a curated subset of Genbank that has been copied and so the records are technically distinct.
However, the RefSeq geome pages on NCB provide a link to "Identical Genbank Sequence"
For example: RefSeq genome page for E.coli MG1655 provides a link to "Identical Genbank Sequence"
I contacted NCBI, and NLM Support (nlm-support@nlm.nih.gov) provided the following answer:
As you will learn in the README file, columns 18 and 19 in the
assembly_summary files will give you such pairing:
Column 18: "gbrs_paired_asm" GenBank/RefSeq paired assembly: the
accession.version of the GenBank assembly that is paired to the
given RefSeq assembly, or vice-versa. "na" is reported if the
assembly is unpaired.
Column 19: "paired_asm_comp" Paired assembly comparison: whether
the paired GenBank & RefSeq assemblies are identical or different.
Values:
identical - GenBank and RefSeq assemblies are identical
different - GenBank and RefSeq assemblies are not identical
na - not applicable since the assembly is unpaired
This should be fine for a few thousand accessions. What is the scale here? Tens of thousands? Hundreds of thousands? And scope? Bacteria only, higher eukaryotes, etc?
The NCBI Genomes FTP path has an assembly_report.txt file for each RefSeq assembly that contains RefSeq and GenBank mapping. It may make more sense to download all of the assembly_report.txt files first from FTP, concatenate them and make your own mapping database locally.
good idea, but this will not scale to thousands of accessions, as indicated in the question.