library(biomaRt)
library(tidyr)
library(forcats)
df <- read.csv("gs.txt", sep="\t", header = F, stringsAsFactors = F, strip.white = T)
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
gene_list=unique(sort(gather(df[-1],"","genes")$genes))
test=getBM(attributes = c("hgnc_symbol", "entrezgene"), filters = "hgnc_symbol", values = gene_list, bmHeader = T, mart = mart)
colnames(test)=c("hgnc_symbol","ncbi_gene_id")
test$ncbi_gene_id=as.character(test$ncbi_gene_id)
new_df=df
new_df[-1] <- lapply(new_df[-1], function(x) lvls_revalue(factor(x, levels = test$hgnc_symbol), test$ncbi_gene_id))
input:
> df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 ACEVEDO_LIVER_CANCER_WITH_H3K9ME3_UP ABCA13 ABCB5 AK5 AMOTL1 ANGPTL5 ANK2 ARHGAP18 ARMC3 B3GNT2
2 ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN A2ML1 ABCA11P ABCC8 ABO ACCN2 ACSL1 ACSS3 ADAD1 ADAM11
3 ACEVEDO_NORMAL_TISSUE_ADJACENT_TO_LIVER_TUMOR_DN AARS ABCC2 ABHD10 ABHD14B ABHD6 ACADSB ACAT2 ACTG1 ADH4
4 ACTGCAG_MIR173P ACACA ACTR1A ANKRD50 ARID2 ARMC8 BCORL1 BRWD3 BTF3 C20orf20
5 ACTGCCT_MIR34B ABCC1 ACACA ACSL1 ACTL6A ACTR1A ADCY2 ADORA2A AHCYL2 AKAP1
6 ACTGTAG_MIR139 AEBP2 AKIRIN2 ANK2 AP1S2 AP3M1 APLP2 ARRDC3 ATP2B2 ATRX
7 ACTGTGA_MIR27A_MIR27B ABCA1 ABCB9 ACOT11 ACVR1 ACVR2A ADAM19 ADAMTS10 ADCY3 ADCY6
8 AFP1_Q6 AARS2 ABLIM1 ACVR1B ADAM11 AFF3 AGL ANKRD28 ANKRD39 ANKS1B
9 AGGCACT_MIR5153P ARHGEF12 ARIH1 ARL4C BAZ1B BAZ2A BCOR BICD2 BTBD3 C10orf140
10 AGGTGCA_MIR500 ABCC4 ADAMTSL3 ANKRD13A ATL1 B4GALNT3 BTBD11 C17orf74 C5orf30 CA10
V11 V12
1 BMPER C1orf88
2 ADAM2 ADAM5P
3 ADH6 ADI1
4 CACNA2D4 CAMK2D
5 ALCAM ANGPTL7
6 ATXN1 AUTS2
7 ADORA2B AFAP1
8 ARHGDIB ARL3
9 C11orf87 C14orf45
10 CACNB1 CAMK4
output (Please cross check if NAs are real NAs):
> new_df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 ACEVEDO_LIVER_CANCER_WITH_H3K9ME3_UP 154664 340273 26289 154810 253935 287 93663 219681 10678 168667 <NA>
2 ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN 144568 <NA> 6833 28 <NA> 2180 79611 132612 4185 2515 <NA>
3 ACEVEDO_NORMAL_TISSUE_ADJACENT_TO_LIVER_TUMOR_DN 16 1244 55347 84836 57406 36 39 71 127 130 55256
4 ACTGCAG_MIR173P 31 10121 57182 196528 25852 63035 254065 689 <NA> 93589 817
5 ACTGCCT_MIR34B 4363 31 2180 86 10121 108 135 23382 8165 214 10218
6 ACTGTAG_MIR139 121536 55122 287 8905 26985 334 57561 491 546 6310 26053
7 ACTGTGA_MIR27A_MIR27B 19 23457 26027 90 92 8728 81794 109 112 136 60312
8 AFP1_Q6 57505 3983 91 4185 3899 178 23243 51239 56899 397 403
9 AGGCACT_MIR5153P 23365 25820 10123 9031 11176 54880 23299 22903 <NA> 399947 <NA>
10 AGGTGCA_MIR500 10257 57188 88455 51062 283358 121551 <NA> 90355 56934 782 814
Merge by gene symbols with your data frame.
getBM
function ouput is data frame with gene symbol and NCBI gene ID. Now merge both the data frames by gene symbol. If you do not know how to merge data frame, post example data here from big data frame.Thanks! But yeah I don't know how to merge them by gene symbol. Can you direct me to any useful webpages or packages that teach this?
Mt data set looks like this:
But formatted by rows, the sets are a lot longer and there's about 1600 of them.
Thanks
could you please reformat your post, so that data frame can be clear? You can look at the merge examples here: https://www.statmethods.net/management/merging.html
Looks like genomax reformatted it for me. Thanks genomax
For future posts: you can do this formatting by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
I am assuming every row has different number of genes?
Hi
I am new to this R and Bioinformatics game. I tried using the code posted by cpad0112 with my dataset and got the following error.
I think I have found a workaround using "list attributes" to find alternative attributes but would appreciate it somebody could confirm what attribute people use in place of hngc_symbol or correct any errors in my code.
thanks
Peter
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.SUBMIT ANSWER
is for new answers to original question.