Entering edit mode
7.6 years ago
asd
▴
20
I would like to get the protein domain name, start and end of a gene by its name in R. A Web API is also acceptable.
My goal is to plot DNA mutations on protein domain level, like the cBioPortal MutationMapper, but I would like to do it programmatically in R. I know that this information available in the Pfam database, but I don't know how to get that data.
I have read the previous posts in similar topics, but I didn't find a solution. Thank you for help!
Thank you, bioMart returns the required results, but it contains too much row and not just those, which annotated as 'Pfam' and 'low_complexity' on Pfam website.
How can I annotate it with this source and domain column?
EnsEMBL bioMart's HTML looks buggy: results are returned per transcript, even if you haven't selected the transcript IDs to be returned and even if you request unique results only. However, exporting unique results as tsv file seems to work as expected.
For TP53 the bioMart unique tsv contains 17 row but the Pfam website just 13. BioMart has domain from 1 to 156, Pfam has 1 to 23.
Why is this difference?
It looks like the unique results in the tsv file still contain results corresponding to different transcripts and so likely slightly different proteins. Since you want to locate mutations relative to protein domains, you should anyway consider all proteins produced by a given gene. Note that Pfam has no notion of genes or of underlying genome, it just annotates proteins from UniProt, usually only the canonical sequence, not the variants whereas EnsEMBL does annotate all proteins.