Custom matrix building using R from a datafram
1
0
Entering edit mode
4.6 years ago
clementpch • 0

Hi Everyone,

I want to produce a matrix which it have in row.names the "gene ID" and in column "Transcription Factor ID" with in value the occurrency of the TF binding related to the nearest gene.

So I have produce a data frame with the wanted three information following these commands :

# create a dataframe with TFs binding site occurencies
geneTF.mtx.tmp = geneTF.df %>% group_by(geneTF.df$geneID, geneTF.df$TFID) %>% summarise(n = n())

# Create a data table from the previously create dataframe
geneTF.mtx.tmp2 = data.table(geneID = geneTF.mtx.tmp$geneID, TfID = geneTF.mtx.tmp$TfID, occurancy  = geneTF.mtx.tmp$occurency)

I obtained this type of data frame :

> head(geneTF.mtx.tmp2)
                  geneID
1: CMiso1.1chr01g0058191
2: CMiso1.1chr01g0058191
3: CMiso1.1chr01g0058191
4: CMiso1.1chr01g0058191
5: CMiso1.1chr01g0058191
6: CMiso1.1chr01g0058191
                                                       TfID occurancy
1:     ABF1(bZIP)/Arabidopsis-ABF1-ChIP-Seq(GSE80564)/Homer        24
2:    At3g60580(C2H2)/col-At3g60580-DAP-Seq(GSE60143)/Homer        17
3: At5g04390(C2H2)/col200-At5g04390-DAP-Seq(GSE60143)/Homer        19
4: AT5G60130(ABI3VP1)/col-AT5G60130-DAP-Seq(GSE60143)/Homer        24
5:             ATAF1(NAC)/col-ATAF1-DAP-Seq(GSE60143)/Homer        32
6:           AtGRF6(GRF)/col-AtGRF6-DAP-Seq(GSE60143)/Homer         3

To summary, from this data I want to obtain a matrix with in row names the "geneID" column of the data frame, in column names the the "TfID" column and in value the "occurency" column.

Thanks in advance for your response

R • 2.0k views
ADD COMMENT
0
Entering edit mode

Hi bruce.moran,

Thanks a lot for your help, finally you last suggestion work well. I will take in count your advice, thanks.

ADD REPLY
0
Entering edit mode

Please do not add an answer unless you're answering the top level question. To provide feedback on someone else's answer, please use the options below:

Upvote|Bookmark|Accept

The "Accept" option is available only for posts (Questions) that you created.

ADD REPLY
0
Entering edit mode

Try this:

library(tidyr)
library(tibble)
spread(geneTF.mtx.tmp2,TfID,occurancy) %>% column_to_rownames("geneID")

But this would fail. Row names must be unique. geneID column has several repetitions. @ clementpch

ADD REPLY
0
Entering edit mode
4.6 years ago
bruce.moran ▴ 970

You can use column_to_rownames() function from the tibble package.

geneTF.df.rn <- geneTF.df %>% 
                group_by(geneID, TfID) %>% 
                summarise(occurrence = n()) %>%
                ungroup() %>%
                as.data.frame() %>%
                column_to_rownames("geneID")
ADD COMMENT
0
Entering edit mode

Hi bruce.moran, thanks for your response. I cannot instal the tibble package because I've some issue with the dependency of readr and dplyr already install. so I've used the column_to_rownames() function from the package textshape (v1.7.1). I've launch your code but I have an issues, see bellow the console error message :

    > geneTF.df.rn <- geneTF.df %>% 
+                 group_by(geneTF.df$geneID, geneTF.df$TFID) %>% 
+                 summarise(occurrence = n()) %>%
+                 ungroup() %>%
+                 as.data.frame() %>%
+                 column_to_rownames("geneID")
Error in x[[i]] <- value : 
  attempt to select more than one element in integerOneIndex
>

Do you have any idea if it's due to the function (that are not the function from the tribble package or it's du to my data) ?

thanks

ADD REPLY
0
Entering edit mode

Sorry, group_by(geneTF.df$geneID, geneTF.df$TFID) should be group_by(geneID, TfID), (as per my edited answer). NB the spelling mistake on TFID. Also you don't use the dollar nomenclature in tidyverse, just the unquoted column name.

ADD REPLY
0
Entering edit mode

Hi, thanks a lot to respond to me. I have do the update on the command that you send me. I have an issue when I launch the command, I think because of the fact that for one gene I have different motifs related to.

Here you can see the following terminal error message:

> head(geneTF.df)
                                                                        peakID
1   trimmed_ATACseq_mal4_rep4_S3_001_nucleosome_shift_37_extSize_73_peak_17581
2                                                      peak_diffBind_598125070
3                                                    peak_diffBind_598125070-2
4                                                   peak_diffBind_1237925070-2
5                                                     peak_diffBind_1237925070
6 trimmed_ATACseq_herma3_rep4_S1_001_nucleosome_shift_37_extSize_73_peak_17204
                                                TFID                geneID
1 bZIP48(bZIP)/colamp-bZIP48-DAP-Seq(GSE60143)/Homer CMiso1.1chr12g0200231
2 bZIP48(bZIP)/colamp-bZIP48-DAP-Seq(GSE60143)/Homer CMiso1.1chr12g0200231
3 bZIP48(bZIP)/colamp-bZIP48-DAP-Seq(GSE60143)/Homer CMiso1.1chr12g0200231
4 bZIP48(bZIP)/colamp-bZIP48-DAP-Seq(GSE60143)/Homer CMiso1.1chr12g0200231
5 bZIP48(bZIP)/colamp-bZIP48-DAP-Seq(GSE60143)/Homer CMiso1.1chr12g0200231
6 bZIP48(bZIP)/colamp-bZIP48-DAP-Seq(GSE60143)/Homer CMiso1.1chr12g0200231
> geneTF.df.rn <- geneTF.df %>% 
+                 group_by(geneID, TFID) %>% 
+                 summarise(occurency = n()) %>%
+                 ungroup() %>%
+                 as.data.frame() %>%
+                 column_to_rownames("geneID")
Error in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed
De plus : Warning message:
non-unique values when setting 'row.names': ‘CMiso1.1chr01g0058191’, ‘CMiso1.1chr01g0058211’, ‘CMiso1.1chr01g0058301’, ‘CMiso1.1chr01g0058341’, ‘CMiso1.1chr01g0058361’, ‘CMiso1.1chr01g0058391’, ‘CMiso1.1chr01g0058471’, ‘CMiso1.1chr01g0058531’, ‘CMiso1.1chr01g0058541’, ‘CMiso1.1chr01g0058841’, ‘CMiso1.1chr01g0058881’, ‘CMiso1.1chr01g0058891’, ‘CMiso1.1chr01g0058931’, ‘CMiso1.1chr01g0058991’, ‘CMiso1.1chr01g0059021’, ‘CMiso1.1chr01g0059071’, ‘CMiso1.1chr01g0059091’, ‘CMiso1.1chr01g0059131’, ‘CMiso1.1chr01g0059261’, ‘CMiso1.1chr01g0059281’, ‘CMiso1.1chr01g0059301’, ‘CMiso1.1chr01g0059321’, ‘CMiso1.1chr01g0059401’, ‘CMiso1.1chr01g0059591’, ‘CMiso1.1chr01g0059761’, ‘CMiso1.1chr01g0060141’, ‘CMiso1.1chr01g0060291’, ‘CMiso1.1chr01g0060341’, ‘CMiso1.1chr01g0060381’, ‘CMiso1.1chr01g0060391’, ‘CMiso1.1chr01g0060501’, ‘CMiso1.1chr01g0060771’, ‘CMiso1.1chr01g0060791’, [... truncated] 
>

As you can see here, I have more or less 2 billions of genes related to TF. then I do an unique on it and as you can see I have a differences in the number of raw due to the fact that I tell before.

More over, when I try to not put the geneID in row names to bypass the issues redondancy, I obtained the data frame (as exepected) but the "TFID" is not in column names like this exemple :

                  TFID_X                   TFIF_Y
geneID_X     occurency_x                occurency_y
geneID_Y     occurency_x2               occurency_y2

but I have this following data in output :

> geneTF.df.rn <- geneTF.df %>% 
+                 group_by(geneID, TFID) %>% 
+                 summarise(occurency = n()) %>%
+                 ungroup() %>%
+                 as.data.frame()
> head(geneTF.df.rn)
                 geneID
1 CMiso1.1chr01g0058191
2 CMiso1.1chr01g0058191
3 CMiso1.1chr01g0058191
4 CMiso1.1chr01g0058191
5 CMiso1.1chr01g0058191
6 CMiso1.1chr01g0058191
                                                      TFID occurency
1     ABF1(bZIP)/Arabidopsis-ABF1-ChIP-Seq(GSE80564)/Homer        24
2    At3g60580(C2H2)/col-At3g60580-DAP-Seq(GSE60143)/Homer        17
3 At5g04390(C2H2)/col200-At5g04390-DAP-Seq(GSE60143)/Homer        19
4 AT5G60130(ABI3VP1)/col-AT5G60130-DAP-Seq(GSE60143)/Homer        24
5             ATAF1(NAC)/col-ATAF1-DAP-Seq(GSE60143)/Homer        32
6           AtGRF6(GRF)/col-AtGRF6-DAP-Seq(GSE60143)/Homer         3

I don't find any solution to bypass this redondancy, can you help me again ?

thanks in advance

ADD REPLY
0
Entering edit mode

Why do you need geneID as rownames? I presume this is for input to some package for analysis?

You can make a potentially unique rownames based on concatenating geneID and TfID using

geneTF.df %>%
group_by(geneID, TfID) %>%
summarise(occurrence = n()) %>%
ungroup() %>%
dplyr::mutate(geneID_TfID = paste0(geneID, "_", TfID)) %>%
as.data.frame() %>%
column_to_rownames("geneID_TfID")
ADD REPLY
0
Entering edit mode

Hi,

Because I have to produce a matrix to have the occurency of each TF type per gene as a gene expression matrix for exemple. Yes I have to produce the two necessary matrix for a package.

The issue with this command is that I have not in rowNames the geneID (with no redondancy) and in colnames my TFID column with in data value the occurancy, do you see what I mean? It's necessary to have the geneID as the same as my gene expression matrix.

ADD REPLY
0
Entering edit mode

It's necessary to have the geneID as the same as my gene expression matrix

Don't see how you can have that, and also have a gene-TfID matrix with same rownames, but multiple TfID per gene, given multiple TfID per gene?

If you include the package you are using that would be helpful, usually the inputs are clearly defined with example to copy to construct correctly.

ADD REPLY
0
Entering edit mode

Sorry but I cannot give you the package because it's a custom packages in devlopment in my lab and the files format of the non-customed package are not the same, so I it will be not usefull for you.

I think I didn't be suffisently clear to understand exactly what I want to produce, I mean I want to create a matrix with in rownames the gene ID and the TFID in colnames likes this exemples (that I produce by hand) :

        TF_name1    TF_name2    TF_name3    ... TF_nameX
geneName1   occurency   ...     ...     ... ...
        geneName1_TFname1
geneName2   occurency   ...     ...     ... ...
        geneName2_TFname1

And when the gene doesn't have this TF occurency because it is not target by them I want to put 0 . this will produce a matrix with all my gene in rownames and all my underlight transcriptor factor in column with the occurency value that is equal to zero if it's not found ( as exemple if the genName_TFid combinaison doesn't existe it mean that the occurency for this TF that target this gene is equal to zero).

Thanks to take time for my issue

ADD REPLY
0
Entering edit mode

Reproducible example dataset to start with:

geneID_tb <- tibble(geneID = paste0("gene_", c(0,0,0,0,1,1,1,1,2,2,2,2)), 
                    TfID = paste0("TfID_", c(1,1,2,2,1,1,1,2,1,1,1,1)))

Then you want to apply per gene, binding rows based on the output of a group_by that is summarised to give per-gene TfID occurrence which is pivot_wider to give the format you want (single row per gene).

do.call(rbind, lapply(unique(geneID_tb$geneID), function(f){ 
    geneID_tb %>% 
    dplyr::filter(geneID %in% f) %>% 
    group_by(geneID, TfID) %>% 
    summarise(occurrence=n()) %>% 
    pivot_wider(names_from="TfID", values_from="occurrence")
})) %>% replaceis.na(.), 0)

I should say that if you are developing methods, there should be someone you can ask in your group who has experience with this.

Also when you ask questions it is important to have a reproducible example, this does 2 things: 1) shows you have bothered to try and make an example, and 2) during the making of the example you sometimes find a way to do it yourself.

ADD REPLY
0
Entering edit mode
> with (geneID_tb, table(geneID, TfID))
        TfID
geneID   TfID_1 TfID_2
  gene_0      2      2
  gene_1      3      1
  gene_2      4      0
ADD REPLY

Login before adding your answer.

Traffic: 1776 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6