I am trying to use biomaRt in R to retrieve the corresponding refseqpeptide IDs for a list of refseqdna mRNA transcript ids. However, for some transcripts, no peptide ID is returned, even though other sources clearly indicate an associated peptide for that transcript. For example, "NM_000092" has this problem. Using the martview web interface, I can reproduce the same results. Here is a link.
[EDIT] - converted URL to tinyurl
You can see that I have queried for refseqdna equal to NM000092, and retrieved dna and protein identifiers in both refseq and Ensembl. Only the refseq protein ID is empty. If you look on the NCBI record for NM000092, you'll see that the answer should be NP000083:
/product="collagen alpha-4(IV) chain precursor"
/protein_id="NP_000083.3"
Also, if I search on bioDBnet's db2db tool, it does find the associated peptide ID.
Furthermore, searching with IDConverter also yields the correct results, and IDConverter explicitly states that its refseq_peptide info comes from Ensembl, which is presumably the same source as biomart.
So why isn't biomart finding some mRNA-peptide associations that other tools are?
Hmm. According to Ensembl, the two transcripts listed differ by one exon, and the protein products are not identical. So it looks like the problem is inconsistencies between RefSeq and Ensembl, which are causing problems because biomart is presumably using Ensembl IDs as an intermediary to do the conversion from RefSeq RNA to RefSeq peptide. I need to handle this programmatically for several hundred problemmatic transcripts. Is there any package in R/Bioconductor that converts RefSeq RNA to RefSeq peptide without going through Ensembl IDs?
Also, querying for the protein products of a gene is not the same as querying for the protein products of a transcript, and for my application, the difference matters. So querying based on gene name is not really way for me to fix this.
I tried using DAVIDQuery in place of biomart, but DAVID does gene-centric queries, so it has the same problem as my previous comment.
The problem is in the mappings, not in the query. This is just a strange case, where RefSeq RNA and RefSeq protein do not map as you would expect - i.e. to the same transcript. If you want a direct mapping, it's probably best to use NCBI resources, as Pierre suggested, rather than BioMart.
The problem is in the mappings, not in the query. This is just a strange case, where RefSeq RNA and RefSeq protein do not map as you would expect - i.e. to the same transcript. If you want a direct mapping, it's probably best to use NCBI resources rather than BioMart.