How to get the chromosome numbers from RefSeq accession IDs?
3
0
Entering edit mode
8.2 years ago

I have an array of RefSeq accession IDs, which looks like the following:

NC_000001.11 NC_000002.12 NC_000003.12 NC_000004.12 NC_000005.10 NC_000006.12 NC_000007.14 NC_000008.11 NC_000009.12 NC_000010.11 NC_000011.10 . . .

I am interested in knowing which chromosomes they refer to? Is there a way to automatically retrieve this information?

RefSeq • 7.6k views
ADD COMMENT
0
Entering edit mode

Definately not a good solution but you can upload the ids to batch entrez (http://www.ncbi.nlm.nih.gov/sites/batchentrez) and get the summary as output text file which gives the chromosome name.

ADD REPLY
0
Entering edit mode

See my comment on your previous question CGAT: Error while running gtf2gtf

ADD REPLY
1
Entering edit mode
8.2 years ago

It's not ideal, but you can download the assembly report, which has these IDs and the associated chromosome names (in UCSC and Ensembl nomenclatures). I have to do this sort of thing when I update the chromosome name conversion tables :(

ADD COMMENT
0
Entering edit mode
5.4 years ago

You could find the chromosomes of the alternative accession numbers (NT_... / NW_...) in this directory.
Download the files with the name :
1. alts_accessions_GRCh38.p12
2. chr_NC_gi
3. chr_accessions_GRCh38.p12
4. unplaced_accessions_GRCh38.p12
5. unlocalized_accessions_GRCh38.p12

Once you download them, you might be prompted to enter some 'Keychain Access' password. The workaround which I found for this is that to convert the downloaded file to a '.txt' format and you'll be able to view whats inside the file.

An extract from the file is given below :

Chromosome RefSeq Accession.version

1 NW_012132914.1
1 NW_015495298.1
9 NW_009646201.1
10 NW_011332692.1
11 NW_015148966.1
Reference : This article.

ADD COMMENT
0
Entering edit mode
5.1 years ago
vkkodali_ncbi ★ 3.8k

See this post: A: How to obtain the chromosome out of an accession number? In short, NCBI RefSeq provides assembly_report.txt files with each genome assembly that has the mapping information in a tab-delimited table. That would be the most up-to-date source for this sort of information.

ADD COMMENT
0
Entering edit mode

I think you mean the sequence_report.jsonl

dataformat tsv genome-seq --inputfile sequence_report.jsonl | cut -f4,10 | head -5

Output:

Chromosome name RefSeq seq accession

1 NC_010443.5

2 NC_010444.4

3 NC_010445.4

4 NC_010446.5

ADD REPLY

Login before adding your answer.

Traffic: 2497 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6