Question

Get protein domain information from gene name in R

2

Entering edit mode

7.6 years ago

asd ▴ 20

I would like to get the protein domain name, start and end of a gene by its name in R. A Web API is also acceptable.

My goal is to plot DNA mutations on protein domain level, like the cBioPortal MutationMapper, but I would like to do it programmatically in R. I know that this information available in the Pfam database, but I don't know how to get that data.

I have read the previous posts in similar topics, but I didn't find a solution. Thank you for help!

R package protein • 4.7k views

ADD COMMENT • link updated 6.8 years ago by Biostar 20 • written 7.6 years ago by asd ▴ 20

score 1 · Answer 1 · 2017-05-03

1

Entering edit mode

7.6 years ago

Jean-Karim Heriche 27k

You can do this using EnsEMBL. Use either the BioMart interface or the perl API.

EDIT: Forgot the R bit: there's the bioMaRt bioconductor package.

ADD COMMENT • link 7.6 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thank you, bioMart returns the required results, but it contains too much row and not just those, which annotated as 'Pfam' and 'low_complexity' on Pfam website.

How can I annotate it with this source and domain column?

ADD REPLY • link 7.6 years ago by asd ▴ 20

0

Entering edit mode

EnsEMBL bioMart's HTML looks buggy: results are returned per transcript, even if you haven't selected the transcript IDs to be returned and even if you request unique results only. However, exporting unique results as tsv file seems to work as expected.

ADD REPLY • link 7.6 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

For TP53 the bioMart unique tsv contains 17 row but the Pfam website just 13. BioMart has domain from 1 to 156, Pfam has 1 to 23.

Why is this difference?

ADD REPLY • link 7.6 years ago by asd ▴ 20

0

Entering edit mode

It looks like the unique results in the tsv file still contain results corresponding to different transcripts and so likely slightly different proteins. Since you want to locate mutations relative to protein domains, you should anyway consider all proteins produced by a given gene. Note that Pfam has no notion of genes or of underlying genome, it just annotates proteins from UniProt, usually only the canonical sequence, not the variants whereas EnsEMBL does annotate all proteins.

ADD REPLY • link 7.6 years ago by Jean-Karim Heriche 27k