Non-Unique Refseq Ids ?
3
2
Entering edit mode
13.1 years ago
Luqman ▴ 30

Hi,

Does anyone have a problem with refseq ids seemingly not identifying genes uniquely?

I counted 400+ of them existing in current Refseq versions. Some examples are: NM001080141, NM001080146, NM001080137, NM001080138.

In UCSC, these are the coordinates for NM001080141:
NM
001080141 at chrX:120077416-120080733
NM001080141 at chrX:120082277-120085594
NM
001080141 at chrX:120096881-120100198
NM001080141 at chrX:120092020-120095337
NM
001080141 at chrX:120116321-120119638
NM001080141 at chrX:120067695-120071012
NM
001080141 at chrX:120072556-120075873
NM001080141 at chrX:120101741-120105058
NM
001080141 at chrX:120106601-120109918
NM_001080141 at chrX:120111461-120114778

Which gene database should I use if I want unique ids for every gene and isoform?

Thanks!

refseq identifiers gene • 3.4k views
ADD COMMENT
2
Entering edit mode
13.1 years ago

That all of the above examples map to the X chromosome is not a concern to me. There are several segments of X that are duplicated - this is part of the biology of a single X in males. In fact, there is a segment of Y that matches at nearly 100% sequence identity to a segment of X.

ADD COMMENT
1
Entering edit mode
13.1 years ago

Use the UCSC knownGene database where one identifier=one genomic position.

http://bioinformatics.oxfordjournals.org/content/22/9/1036.full

or Ensembl genes: http://genome.cshlp.org/content/14/5/942.abstract

ADD COMMENT
0
Entering edit mode

it is useful that UCSC provides a N:1 mapping to entrez gene ids

ADD REPLY
0
Entering edit mode

@Marcin yes, it is the table kgXRef (see http://bioinformatics.oxfordjournals.org/content/22/9/1036.full )

ADD REPLY
1
Entering edit mode
12.7 years ago

If you grab the sequence for NM_001080141.1 (http://www.ncbi.nlm.nih.gov/nuccore/121949793?report=fasta) and then BLAT it against the human genome, you'll find numerous perfect matches. If you only picked one locus for this sequence, it would be an arbitrary choice. UCSC knownGene picks a single entry from the list (120092019-120095337) but it's not obvious to me why this particular locus was picked. While this is annoying, it's the biology of the sequence.

ADD COMMENT

Login before adding your answer.

Traffic: 2796 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6