Question

Using biomaRt to convert gene symbols to entrez id in dataframe of gene-sets

2

Entering edit mode

6.5 years ago

jack.bakewell ▴ 20

Hi

I have a large data frame of gene sets whose components are in the form of gene symbols. I'm trying to use biomaRt to convert these symbols into entrez IDs so that I can run the sets through MAGMA. However, I can't work out how to convert the data frame without destroying the structure of the gene sets. I always get a data frame just listing the the IDs of genes in my sets, which obviously isn't very useful for me. Can anyone help?

I thought this would work to preserve the headers of the original data frame but I still just get a list:

sczccgenesid <- getBM(attributes = c("hgnc_symbol", "entrezgene"), filters = "hgnc_symbol", values = sczccgenes2, bmHeader = T, mart = ensembl)

Thanks

R biomaRt entrez • 15k views

ADD COMMENT • link updated 5.3 years ago by zx8754 12k • written 6.5 years ago by jack.bakewell ▴ 20

0

Entering edit mode

Merge by gene symbols with your data frame. getBM function ouput is data frame with gene symbol and NCBI gene ID. Now merge both the data frames by gene symbol. If you do not know how to merge data frame, post example data here from big data frame.

ADD REPLY • link 6.5 years ago by cpad0112 21k

0

Entering edit mode

Thanks! But yeah I don't know how to merge them by gene symbol. Can you direct me to any useful webpages or packages that teach this?

Mt data set looks like this:

ACEVEDO_LIVER_CANCER_WITH_H3K9ME3_UP    ABCA13  ABCB5   AK5 AMOTL1  ANGPTL5 ANK2    ARHGAP18    ARMC3   B3GNT2  BMPER   C1orf88
ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN   A2ML1   ABCA11P ABCC8   ABO ACCN2   ACSL1   ACSS3   ADAD1   ADAM11  ADAM2   ADAM5P
ACEVEDO_NORMAL_TISSUE_ADJACENT_TO_LIVER_TUMOR_DN    AARS    ABCC2   ABHD10  ABHD14B ABHD6   ACADSB  ACAT2   ACTG1   ADH4    ADH6    ADI1
ACTGCAG_MIR173P ACACA   ACTR1A  ANKRD50 ARID2   ARMC8   BCORL1  BRWD3   BTF3    C20orf20    CACNA2D4    CAMK2D
ACTGCCT_MIR34B  ABCC1   ACACA   ACSL1   ACTL6A  ACTR1A  ADCY2   ADORA2A AHCYL2  AKAP1   ALCAM   ANGPTL7
ACTGTAG_MIR139  AEBP2   AKIRIN2 ANK2    AP1S2   AP3M1   APLP2   ARRDC3  ATP2B2  ATRX    ATXN1   AUTS2
ACTGTGA_MIR27A_MIR27B   ABCA1   ABCB9   ACOT11  ACVR1   ACVR2A  ADAM19  ADAMTS10    ADCY3   ADCY6   ADORA2B AFAP1
AFP1_Q6 AARS2   ABLIM1  ACVR1B  ADAM11  AFF3    AGL ANKRD28 ANKRD39 ANKS1B  ARHGDIB ARL3
AGGCACT_MIR5153P    ARHGEF12    ARIH1   ARL4C   BAZ1B   BAZ2A   BCOR    BICD2   BTBD3   C10orf140   C11orf87    C14orf45
AGGTGCA_MIR500  ABCC4   ADAMTSL3    ANKRD13A    ATL1    B4GALNT3    BTBD11  C17orf74    C5orf30 CA10    CACNB1  CAMK4

But formatted by rows, the sets are a lot longer and there's about 1600 of them.

Thanks

ADD REPLY • link updated 6.5 years ago by GenoMax 147k • written 6.5 years ago by jack.bakewell ▴ 20

0

Entering edit mode

could you please reformat your post, so that data frame can be clear? You can look at the merge examples here: https://www.statmethods.net/management/merging.html

ADD REPLY • link 6.5 years ago by cpad0112 21k

0

Entering edit mode

Looks like genomax reformatted it for me. Thanks genomax

ADD REPLY • link 6.5 years ago by jack.bakewell ▴ 20

0

Entering edit mode

For future posts: you can do this formatting by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY • link 6.5 years ago by WouterDeCoster 47k

0

Entering edit mode

I am assuming every row has different number of genes?

ADD REPLY • link 6.5 years ago by zx8754 12k

0

Entering edit mode

Hi

I am new to this R and Bioinformatics game. I tried using the code posted by cpad0112 with my dataset and got the following error.

> test=getBM(attributes = c("hgnc_symbol", "entrezgene"), filters = "hgnc_symbol", values = v4$Gene_ID, bmHeader = T, mart = mart)
>Error in getBM(attributes = c("hgnc_symbol", "entrezgene"), filters = "hgnc_symbol",  : 
>Invalid attribute(s): hgnc_symbol, entrezgene 
>Please use the function 'listAttributes' to get valid attribute names

I think I have found a workaround using "list attributes" to find alternative attributes but would appreciate it somebody could confirm what attribute people use in place of hngc_symbol or correct any errors in my code.

thanks

Peter

ADD REPLY • link updated 5.3 years ago by GenoMax 147k • written 5.3 years ago by peter.berry5 ▴ 60

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question.

ADD REPLY • link 5.3 years ago by GenoMax 147k

zx8754 · Accepted Answer · 2018-06-10

library(biomaRt)
library(tidyr)
library(forcats)

df <- read.csv("gs.txt", sep="\t", header = F, stringsAsFactors = F, strip.white = T)
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

gene_list=unique(sort(gather(df[-1],"","genes")$genes))
test=getBM(attributes = c("hgnc_symbol", "entrezgene"), filters = "hgnc_symbol", values = gene_list, bmHeader = T, mart = mart)
colnames(test)=c("hgnc_symbol","ncbi_gene_id")
test$ncbi_gene_id=as.character(test$ncbi_gene_id)

new_df=df
new_df[-1] <- lapply(new_df[-1], function(x) lvls_revalue(factor(x, levels = test$hgnc_symbol), test$ncbi_gene_id))

input:

> df
                                                 V1       V2       V3       V4      V5       V6     V7       V8      V9       V10
1              ACEVEDO_LIVER_CANCER_WITH_H3K9ME3_UP   ABCA13    ABCB5      AK5  AMOTL1  ANGPTL5   ANK2 ARHGAP18   ARMC3    B3GNT2
2             ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN    A2ML1  ABCA11P    ABCC8     ABO    ACCN2  ACSL1    ACSS3   ADAD1    ADAM11
3  ACEVEDO_NORMAL_TISSUE_ADJACENT_TO_LIVER_TUMOR_DN     AARS    ABCC2   ABHD10 ABHD14B    ABHD6 ACADSB    ACAT2   ACTG1      ADH4
4                                   ACTGCAG_MIR173P    ACACA   ACTR1A  ANKRD50   ARID2    ARMC8 BCORL1    BRWD3    BTF3  C20orf20
5                                    ACTGCCT_MIR34B    ABCC1    ACACA    ACSL1  ACTL6A   ACTR1A  ADCY2  ADORA2A  AHCYL2     AKAP1
6                                    ACTGTAG_MIR139    AEBP2  AKIRIN2     ANK2   AP1S2    AP3M1  APLP2   ARRDC3  ATP2B2      ATRX
7                             ACTGTGA_MIR27A_MIR27B    ABCA1    ABCB9   ACOT11   ACVR1   ACVR2A ADAM19 ADAMTS10   ADCY3     ADCY6
8                                           AFP1_Q6    AARS2   ABLIM1   ACVR1B  ADAM11     AFF3    AGL  ANKRD28 ANKRD39    ANKS1B
9                                  AGGCACT_MIR5153P ARHGEF12    ARIH1    ARL4C   BAZ1B    BAZ2A   BCOR    BICD2   BTBD3 C10orf140
10                                   AGGTGCA_MIR500    ABCC4 ADAMTSL3 ANKRD13A    ATL1 B4GALNT3 BTBD11 C17orf74 C5orf30      CA10
        V11      V12
1     BMPER  C1orf88
2     ADAM2   ADAM5P
3      ADH6     ADI1
4  CACNA2D4   CAMK2D
5     ALCAM  ANGPTL7
6     ATXN1    AUTS2
7   ADORA2B    AFAP1
8   ARHGDIB     ARL3
9  C11orf87 C14orf45
10   CACNB1    CAMK4

output (Please cross check if NAs are real NAs):

> new_df
                                                 V1     V2     V3    V4     V5     V6     V7     V8     V9   V10    V11   V12
1              ACEVEDO_LIVER_CANCER_WITH_H3K9ME3_UP 154664 340273 26289 154810 253935    287  93663 219681 10678 168667  <NA>
2             ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN 144568   <NA>  6833     28   <NA>   2180  79611 132612  4185   2515  <NA>
3  ACEVEDO_NORMAL_TISSUE_ADJACENT_TO_LIVER_TUMOR_DN     16   1244 55347  84836  57406     36     39     71   127    130 55256
4                                   ACTGCAG_MIR173P     31  10121 57182 196528  25852  63035 254065    689  <NA>  93589   817
5                                    ACTGCCT_MIR34B   4363     31  2180     86  10121    108    135  23382  8165    214 10218
6                                    ACTGTAG_MIR139 121536  55122   287   8905  26985    334  57561    491   546   6310 26053
7                             ACTGTGA_MIR27A_MIR27B     19  23457 26027     90     92   8728  81794    109   112    136 60312
8                                           AFP1_Q6  57505   3983    91   4185   3899    178  23243  51239 56899    397   403
9                                  AGGCACT_MIR5153P  23365  25820 10123   9031  11176  54880  23299  22903  <NA> 399947  <NA>
10                                   AGGTGCA_MIR500  10257  57188 88455  51062 283358 121551   <NA>  90355 56934    782   814