Question

Gage duplicate identifiers as row names

0

Entering edit mode

7.4 years ago

bsp017 ▴ 50

I have a dataset with Enterez gene annotations and log fold change values under different conditions. I would like to do a geneset enrichment analysis with Gage v2.28.0. I am using RStudio. However I'm not sure how to handle duplicate row.names in column 1

I followed the 'Gene set and data preparation vignette to make sure my data was in the correct format:

cuff.res<-read.csv("swissport_for_gage2.csv", row.names=1, check.names = F)
Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  duplicate 'row.names' are not allowed

If I take out "row.names=1' and run gage I get the following result:

cuff.res<-read.table("swissport_for_gage2.csv", header = T, sep=",")
df1<-na.omit(cuff.res)
new_file1<-as.matrix(df1)
ref.idx=2:3
samp.idx=4:5
keggres = gage(new_file1, gsets=kg.eco.eg$kg.sets, ref = ref.idx, samp = samp.idx)
lapply(keggres, head)
    $greater
                                                      p.geomean stat.mean
    eco00010 Glycolysis / Gluconeogenesis                    NA       NaN
    eco00020 Citrate cycle (TCA cycle)                       NA       NaN
    eco00030 Pentose phosphate pathway                       NA       NaN
    eco00040 Pentose and glucuronate interconversions        NA       NaN
    eco00051 Fructose and mannose metabolism                 NA       NaN
    eco00052 Galactose metabolism                            NA       NaN

Is there a workaround for this? My input data looks like this:

entrezid    Bg_NB_NS_2  Bg_NB_NP_2  BgGq_NB_NB_2    BgGq_NB_NS_2    BgGq_NB_NP_2    BgGq_NS_NS_2
NA  1.33639 0.735912    -1.87482    -2.36335    -1.9769 -3.69974
NA  -0.2294 -0.333797   -0.574163   -1.68241    -0.873274   -1.45301
NA  -0.2294 -0.333797   -0.574163   -1.68241    -0.873274   -1.45301
NA  -0.2294 -0.333797   -0.574163   -1.68241    -0.873274   -1.45301
9126923 -0.2294 -0.333797   -0.574163   -1.68241    -0.873274   -1.45301
NA  1.46519 0.568023    -3.50016    -3.34538    -2.1212 -4.81057
NA  -0.2294 -0.333797   -0.574163   -1.68241    -0.873274   -1.45301
9126923 0.655123    0.202802    -2.62253    -2.04046    -2.21114    -2.69559
1234980 -3.81436    -3.91876    0.541314    -0.0579239  0.399745    3.75644
NA  -0.2294 -0.333797   -0.574163   -1.68241    -0.873274   -1.45301
1234980 -0.2294 -0.333797   -0.574163   -1.68241    -0.873274   -1.45301
9126923 -0.2294 -0.333797   -0.574163   -1.68241    -0.873274   -1.45301
NA  -0.2294 -0.333797   -0.574163   -1.68241    -0.873274   -1.45301
1175404 -0.2294 -0.333797   -0.574163   -1.68241    -0.873274   -1.45301
877311  -0.2294 -0.333797   -0.574163   -1.68241    -0.873274   -1.45301
NA  -3.03675    -3.14115    2.14204 2.69014 1.9918  5.72689

Thanks James

RNA-Seq gage bacteria kegg entrezid • 1.8k views

ADD COMMENT • link updated 7.4 years ago by h.mon 35k • written 7.4 years ago by bsp017 ▴ 50

score 3 · Accepted Answer · 2018-02-16

3

Entering edit mode

7.4 years ago

h.mon 35k

EntrezID should be the rownames of the matrix, so GAGE can know which gene each row corresponds:

new_file1<-as.matrix(df1)
rownames(new_file1) <- df1$entrezid

ADD COMMENT • link 7.4 years ago by h.mon 35k