Question

How to obtain the chromosome out of an accession number?

0

Entering edit mode

6.6 years ago

eidriangm ▴ 60

Hello Community.

My problem is the following, I have some bed files whose genomic regions are annotated using the chromosome (chr__ start end ... ...), and I want to use the ncbi gff3 to extract the info but this file is annotated using accession.version numbers. Bedtools oblige me to use the same location nomencaluture thus I need to transform the accession to chr base.

So far I know that the number of the "NC_" prefixed accessions id specify the chromosme, (i.e: NC_000001.11: chr1, NC_000002.12: chr2, ..., NC_000023.11: chrX, NC_000024.10:chrY, NC_012920.1: chrM ). Nevertheless, how can I know which is the chromosome of the accessions prefixed with NW_ or NT_?

Some "NT_ , NW_" are alternative assemblies of NC_ and the info contained is "the same" being placed lines below that NC_, but some others do not and contains genes of interest which I could be loosing when using bedtools i.e https://www.ncbi.nlm.nih.gov/gene/3806. Some do not have a known location but that gene is known to be in the chromosome 19 and I can not deduce it from its accession number.

Is there a way of getting the chromosome from the accession number? Or shall I extract the info from another annotation file?

Thanks

genome refseq chromosome accession ncbi • 9.0k views

ADD COMMENT • link updated 5.1 years ago by vkkodali_ncbi ★ 3.8k • written 6.6 years ago by eidriangm ▴ 60

0

Entering edit mode

Have you tried potential way(s) of linking chromosomes to accession number mentioned in this post: How to get the chromosome numbers from RefSeq accession IDs ?

ADD REPLY • link 6.6 years ago by Sej Modha 5.3k

0

Entering edit mode

I saw it but all the links provided there are not working and the answer with awk + sed only applies with NC_ (already under control). Thanks anyway

ADD REPLY • link 6.6 years ago by eidriangm ▴ 60

0

Entering edit mode

you may want to give some example data and expected output.

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

Well that is already given in the the question, with the Entrez ID gene 3806, which is annotated in the accession NT_113949 and I want to obtain the chromosome which is number 19. I could look for more examples but the idea is basically that, from an accession number prefixed with NT_ NW_ obtain its chromose if it is known.

ADD REPLY • link 6.5 years ago by eidriangm ▴ 60

0

Entering edit mode

http://gtamazian.blogspot.com/2013/08/converting-chromosome-accession-numbers.html

ADD REPLY • link 5.4 years ago by srijan.verma44 • 0

0

Entering edit mode

ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/Assembled_chromosomes/

ADD REPLY • link 5.4 years ago by srijan.verma44 • 0

score 0 · Answer 1 · 2019-07-04

You could find the chromosomes of the alternative accession numbers (NT_... / NW_...) in this directory.
Download the files with the name :
1. alts_accessions_GRCh38.p12
2. chr_NC_gi
3. chr_accessions_GRCh38.p12
4. unplaced_accessions_GRCh38.p12
5. unlocalized_accessions_GRCh38.p12

Once you download them, you might be prompted to enter some 'Keychain Access' password. The workaround which I found for this is that to convert the downloaded file to a '.txt' format and you'll be able to view whats inside the file.

An extract from the file is given below :

Chromosome RefSeq Accession.version

1 NW_012132914.1
1 NW_015495298.1
9 NW_009646201.1
10 NW_011332692.1
11 NW_015148966.1
Reference : This article.

score 0 · Answer 2 · 2019-07-04

Perhaps you could do it in R, using rentrez package. Take a look here.

I'm doing something kinda similar, and it is possible to input those identifiers and ask for a summary (using entrez_summary function). In that summary should appear chromosome number/name.

Let me know if you need some more help.

Cheers,

score 0 · Answer 3 · 2019-10-23

An assembly_report.txt file accompanies NCBI RefSeq genome assemblies that can be downloaded either from the NCBI Assembly portal by searching for the genome of interest and picking the Assembly structure report from the big blue downloads button menu or by going to the NCBI genomes FTP path for the assembly of interest.

For example, you can find the human assembly_report.txt file here: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt

This file has the following columns:

                # Sequence-Name [  1]: 1
                  Sequence-Role [  2]: assembled-molecule
              Assigned-Molecule [  3]: 1
Assigned-Molecule-Location/Type [  4]: Chromosome
                   GenBank-Accn [  5]: CM000663.2
                   Relationship [  6]: =
                    RefSeq-Accn [  7]: NC_000001.11
                  Assembly-Unit [  8]: Primary Assembly
                Sequence-Length [  9]: 248956422
                UCSC-style-name [ 10]: chr1

You can use the data in columns 7 and 10 to map acc.ver to UCSC-style chromosome names. If you don't want to bother with coming up with all of the relevant logic and just need to quickly convert the seq-ids in an NCBI RefSeq GFF3 file, you can use my script cthreepo (https://github.com/vkkodali/cthreepo) for this purpose.