Biomart Yields Incomplete Results When Converting Refseq_Dna To Refseq_Peptide
2
3
Entering edit mode
14.2 years ago
Ryan Thompson ★ 3.6k

I am trying to use biomaRt in R to retrieve the corresponding refseqpeptide IDs for a list of refseqdna mRNA transcript ids. However, for some transcripts, no peptide ID is returned, even though other sources clearly indicate an associated peptide for that transcript. For example, "NM_000092" has this problem. Using the martview web interface, I can reproduce the same results. Here is a link.

[EDIT] - converted URL to tinyurl

http://tinyurl.com/37e6s6e

You can see that I have queried for refseqdna equal to NM000092, and retrieved dna and protein identifiers in both refseq and Ensembl. Only the refseq protein ID is empty. If you look on the NCBI record for NM000092, you'll see that the answer should be NP000083:

/product="collagen alpha-4(IV) chain precursor"
/protein_id="NP_000083.3"

Also, if I search on bioDBnet's db2db tool, it does find the associated peptide ID.

Furthermore, searching with IDConverter also yields the correct results, and IDConverter explicitly states that its refseq_peptide info comes from Ensembl, which is presumably the same source as biomart.

So why isn't biomart finding some mRNA-peptide associations that other tools are?

biomart conversion • 3.8k views
ADD COMMENT
5
Entering edit mode
14.2 years ago
Neilfws 49k

I think that this occurs due to some subtleties in both the way that BioMart works and the way RefSeq defines reference mRNAs and their products.

Try this query instead: http://tinyurl.com/27d8nk5

It uses the HGNC symbol for the gene (COL4A4), in place of the RefSeq mRNA. You should see a result like this:

BioMart COL4A4

This shows that there are 2 Ensembl transcripts. One of them maps to the RefSeq mRNA, the other maps to the RefSeq protein.

It is a little difficult to determine what the RefSeq curators had in mind here! The protein product of each Ensembl transcript is the same length. Presumably, someone has decided that the reference mRNA in RefSeq should map to one of the transcripts, but the reference protein should map to the other.

I guess the conclusion is: try different search terms if you don't see what you expected.

ADD COMMENT
0
Entering edit mode

Hmm. According to Ensembl, the two transcripts listed differ by one exon, and the protein products are not identical. So it looks like the problem is inconsistencies between RefSeq and Ensembl, which are causing problems because biomart is presumably using Ensembl IDs as an intermediary to do the conversion from RefSeq RNA to RefSeq peptide. I need to handle this programmatically for several hundred problemmatic transcripts. Is there any package in R/Bioconductor that converts RefSeq RNA to RefSeq peptide without going through Ensembl IDs?

ADD REPLY
0
Entering edit mode

Also, querying for the protein products of a gene is not the same as querying for the protein products of a transcript, and for my application, the difference matters. So querying based on gene name is not really way for me to fix this.

ADD REPLY
0
Entering edit mode

I tried using DAVIDQuery in place of biomart, but DAVID does gene-centric queries, so it has the same problem as my previous comment.

ADD REPLY
0
Entering edit mode

The problem is in the mappings, not in the query. This is just a strange case, where RefSeq RNA and RefSeq protein do not map as you would expect - i.e. to the same transcript. If you want a direct mapping, it's probably best to use NCBI resources, as Pierre suggested, rather than BioMart.

ADD REPLY
0
Entering edit mode

The problem is in the mappings, not in the query. This is just a strange case, where RefSeq RNA and RefSeq protein do not map as you would expect - i.e. to the same transcript. If you want a direct mapping, it's probably best to use NCBI resources rather than BioMart.

ADD REPLY
0
Entering edit mode
14.1 years ago
Uma • 0

bioDBnet's dbWalk tool can be used to define the path to be used for conversions. So in this case the bioDBnet path would be 'RefSeq mRNA Accession->RefSeq Protein Accession'.

ADD COMMENT

Login before adding your answer.

Traffic: 2623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6