Uniprot to refseq missing entries
0
0
Entering edit mode
9.5 years ago

A large proportion of the Uniprot database is not linked to a refseq nucleotide id. For instance (Q7KZI7-11,Q7KZI7-13,Q496A3-2,Q8NDM7-3,Q8NDM7-2,Q8NDM7-5). I've counted about 10,000 of these. Why is this, and is there a way to patch this?

Thanks,
-Jeremy

uniprot RNA-Seq refseq • 1.7k views
ADD COMMENT
0
Entering edit mode

Looks like they are isoforms that differ from the canonical sequence. If you drop the (-x) part you will get the original sequence ID which is lined to a RefSeq entry.

ADD REPLY
0
Entering edit mode

Thanks genomax. This is definitely the case for most of the missed mappings, but there are still many others that are not isoform accessions such as (O71037,Q9UKH3,Q6ZUT4). These maybe account for 500 or so entries (much better than 10,000 anyway).

ADD REPLY
1
Entering edit mode

First two entries appear to be some sort of retro-viral proteins and the last one is based on a single mRNA sequence. Not enough evidence for RefSeq curators to act on. Looks like you may have to exclude these entries from whatever analysis you are doing.

ADD REPLY
0
Entering edit mode

You are correct. After taking a closer look it would appear that many of the remaining entries are either contaminants or "putative uncharacterized proteins" There are others that have a ENST identity but no refseq mapping. I'm at a point now where I've been able to map 98% in one way or a another and I suspect that more than half of the remaining 1000 (not 500) entries are non-human contaminants. This is good enough that I can attempt direct sequence matching or annotate by hand. Thanks a lot genomax!

ADD REPLY

Login before adding your answer.

Traffic: 2603 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6