I downloaded human.1.protein.faa.gz ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/, which includes protein sequences for mitochrondrial proteins such as this:
>gi|251831107|ref|YP_003024026.1| NADH dehydrogenase subunit 1 (mitochondrion) [Homo sapiens]
MPMANLLLLIVPILIAMAFLMLTERKILGYMQLRKGPNVVGPYGLLQPFADAMKLFTKEPLKPATSTITLYITAPTLALT
IALLLWTPLPMPNPLVNLNLGLLFILATSSLAVYSILWSGWASNSNYALIGALRAVAQTISYEVTLAIILLSTLLMSGSF
NLSTLITTQEHLWLLLPSWPLAMMWFISTLAETNRTPFDLAEGESELVSGFNIEYAAGPFALFFMAEYTNIIMMNTLTTT
IFLGTTYDALSPELYTTYFVTKTLLLTSLFLWIRTAYPRFRYDQLMHLLWKNFLPLTLALLMWYVSMPITISSIPPQT
The difficulty is that I can't get biomaRt
to recognize the id YP_003024026
. human.1.protein.faa includes refseq ids like NP_12345
, which biomaRt
recognizes as "refseq_peptide", and XP_12345
, which biomaRt
recognizes as "refseq_peptide_predicted". I can't figure out how to get it to recognize the YP
sequences. I want to find the corresponding entrez id.
In the meantime, there are only 13 YP_1234, so I've solved this problem "by hand" using http://www.ncbi.nlm.nih.gov/protein/YP_003024026 and looking down to find the GeneID. I'd prefer to do it "the right way" in case the id's change in the future.
any advice?
YP seems to be another example of curated RefSeq protein according to this README. So that explains why we should use RefSeq Protein ID in BioMart as YP = NP but not XP (predicted protein).