Understanding NCBI identifiers
1
0
Entering edit mode
8.2 years ago
tlorin ▴ 370

This is kind of a general question regarding NCBI accesion numbers.

Suppose I have this sequence

>myseq
MGQ-----NSPNLLR------LSQ
--TLVGSSLLSSPSSPTTLKVKMPHAFPFLTPDQ-KKELSDIAHKIVAKGKGILAADES-
--TGSVAKRFQSINTENTEENRRLYRQLLFTA-DERAGPCIGGVIFFHETLYQKTDAGKT
FPEHVKSRGWVVGIKVDKGVVPLAGTN-GETTTQ---GLDGL--------YERCAQYKKD
GCDFAKWRCVLKITSTTPSRLAIMENCNVLARYASICQM--HGIVPIVEPEILPDGDHDL
KRTQYVTEKV-LAAMYKALSDHHVYLEGTLLKPNMVTAGHSCSHKYTHQDIAMATITALR
RTVPPAVPG--ITFLSGGQSEEEASINLNVMNQCPLHRPWAITFSYGRALQASALKAWGG
KPGNGKAAQEEFIKRAL------ANSLACQGKYVSSGN-S-A-AAGDSLFVANHAY

I want to blast it (using blastp and nr) onto the salmon database (Salmo salar). I get three roughly equivalent hits corresponding to three different IDs:

NP_001133180.1, CBL79147.1 and NP_001133181.1

I bet that there are not three different genes. Thus, which sequence(s) should I consider as the 'good' one(s)? The more recent? The 'NP' ones? I could not find any info related to the detailed NCBI sequence identification process (but see this). Many thanks for your advice!

ncbi id • 1.9k views
ADD COMMENT
2
Entering edit mode

In general you should use RefSeq/Swiss-Prot database for protein searches at NCBI since they are likely to contain better curated representatives.

ADD REPLY
0
Entering edit mode
8.2 years ago
Cliff Beall ▴ 480

Those are almost the same protein sequence from the same organism but not exactly. It might be repeated genes, alternative splicing, variation between individuals, or even sequencing errors.

nr is meant to be non-redundant so it will have an entry for every different protein that someone put into the databases. You would need to follow up on the publications listed in the entries to track down exactly what is going on.

ADD COMMENT
0
Entering edit mode

I know that nr is supposed to be non-redundant (that's why I use it), but then why are there only 2 hits left (NP_001133180.1 and NP_001133181.1) when we blast the sequence onto the RefSeq database? Which one should I trust 'in general'? It seems to be that everyone has a 'feeling' about this but I cannot find any way of being sure (based on the sole ID) ;-) But I agree that we can do manual curation of author statements, check contig ID, etc. It's just that for many many genes, it's not possible, and the curation based on ID to avoid redundant sequences should be possible :)

ADD REPLY
0
Entering edit mode

Those two RefSeq ID's have not been subjected to final NCBI review so it is possible that they may be collapsed into one entry after that point.

ADD REPLY

Login before adding your answer.

Traffic: 2066 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6