Making such a data frame in R
1
0
Entering edit mode
6.3 years ago
Za ▴ 140

Hi,

I have a list of TFs and marker genes for each cluster of my cells.

I want to have a data frame like below, for example if a TF exists in my list, in TF column 1 places infant of that gene means that is a TF. Or, if a gene is in my markers 1 be places infant of that if not 0 be added. But really seems to me too difficult to do

supposing I just have 14000 genes and 3 vectors, how I can make a data frame from the scratch like below? A 3 columns data frame; 14000 genes in rows , TF column, marker_1 column and marker_2 column filled by 1 or 0 depends if each of 14000 genes exist in these vectors or not

> head(genes)
                              TF           use_as_marker_1 use_as_marker_2 
    ENSMUSG00000044719  0              1               0              
    ENSMUSG00000044591     0               0               0              
    ENSMUSG00000044712        0               0               0             
    ENSMUSG00000044734      0               0               1              
    ENSMUSG00000044726       0               0               0              
    ENSMUSG00000044724         1               0               0

Could you please help me to do that?

R sc-cell RNA-seq • 2.4k views
ADD COMMENT
2
Entering edit mode

Please, could you provide a complete example of what you want achieve. What do you have in your lists ? Create a small example by hand to describe your problem and your aim.

Seems like you will have to put your 2 lists in 2 vectors then iterate over your dataframe, compare your current TF to your TF vector and compare your current gene to your gene vector and change the TF value if needed.

ADD REPLY
1
Entering edit mode

I have a list of TFs and marker genes for each cluster of my cells.

It is not completely clear to me how the data you have looks like, and an example often helps.

ADD REPLY
0
Entering edit mode

Thanks a lot, let's say I have a vector of 150 TFs, a vector of 500 marker genes for cluster 1 and a vector of 350 markers genes for cluster. My genome has 14000 validated gene ids. I need a data frame like above in which for TF column my 14000 genes being compared with vector of TFs, 14000 genes being compared with marker genes of cluster 1 and also cluster 2 so that if one of 14000 genes is common with genes in my three vectors I have 1 if not not I have 0. For example in above data frame ENSMUSG00000044724 is a TF but not marker of clusters, ENSMUSG00000044719 is a marker of cluster 1 and ENSMUSG00000044734 a marker of cluster 2. Unfortunately I am not able to do iteration or anything complex in R without your help

ADD REPLY
2
Entering edit mode

OK, so in fine you have 3 lists (TF, marker1 and marker2). And for each line of your dataframe

  • if TF_SOURCE exists in TF list, set up TF column to 1

  • If the index (gene name?) exists in marker1 list, set up use_as_marker_1 column to 1

  • Same as previous item for marker2

Correct ?

ADD REPLY
0
Entering edit mode

if my TF column exists in TF list

ADD REPLY
1
Entering edit mode

if your TF does not exist in TF list you want to test maker1 and marker2 or to skip them ?

ADD REPLY
0
Entering edit mode

Actually there is not any defined TF in data frame, please ignore TF_SOURCE column

ADD REPLY
4
Entering edit mode
6.3 years ago
###Recreate your dataframe
df <- data.frame(row.names = c("ENSMUSG00000044719","ENSMUSG00000044591","ENSMUSG00000044712","ENSMUSG00000044734","ENSMUSG00000044726","ENSMUSG00000044724"), "SYMBOL" = c("E230025N22Rik","AC112933.1","Slc38a6","Serpinb1a","BC030476","Gpr152"), "TF" = c(0,0,0,0,0,0), "TF_SOURCE" = c("genomatix_20140512","genomatix_20140512","genomatix_20140512","genomatix_20140513","genomatix_20140513","genomatix_20140513"), "use_as_marker_1" = c(0,0,0,0,0,0), "use_as_marker_2" = c(0,0,0,0,0,0))

TF_list=c("genomatix_20140512")

marker1_list=c("ENSMUSG00000044591","ENSMUSG00000044734","ENSMUSG00000044726")

marker2_list=c("ENSMUSG00000044591","ENSMUSG00000044712","ENSMUSG00000044724")

###Iterate over the number of row (nrow) of your dataframe (from 1 to 6 here)
###In R you can specify a cell with the column name ($column) and the index (i)
###%in% is for exists in
for (i in 1:nrow(df)){
    if (df$TF_SOURCE[i] %in% TF_list){
        df$TF[i]=1
    }
    if (row.names(df)[i] %in% marker1_list){
        df$use_as_marker_1[i]=1
    }
    if (row.names(df)[i] %in% marker2_list){
        df$use_as_marker_2[i]=1
    }
}

EDIT : Without an existing dataframe

genes = c("ENSMUSG00000044719","ENSMUSG00000044591","ENSMUSG00000044712","ENSMUSG00000044734","ENSMUSG00000044726","ENSMUSG00000044724")

df <- data.frame(matrix(ncol = 3, nrow = length(genes)))
row.names(df) <- genes
colnames(df) <- c("TF", "use_as_marker_1", "use_as_marker_2")

TF_list=c("ENSMUSG00000044719","ENSMUSG00000044591","ENSMUSG00000044712")

marker1_list=c("ENSMUSG00000044591","ENSMUSG00000044734","ENSMUSG00000044726")

marker2_list=c("ENSMUSG00000044591","ENSMUSG00000044712","ENSMUSG00000044724")

for (i in 1:nrow(df)){
    if (row.names(df)[i] %in% TF_list){
        df$TF[i]=1
    }else{
        df$TF[i]=0
    }
    if (row.names(df)[i] %in% marker1_list){
        df$use_as_marker_1[i]=1
    }else{
        df$use_as_marker_1[i]=0
    }
    if (row.names(df)[i] %in% marker2_list){
        df$use_as_marker_2[i]=1
    }else{
        df$use_as_marker_2[i]=0
    }
}
ADD COMMENT
0
Entering edit mode

Thank you, I done that but returns empty, nothing changes in datafarame

Sorry, supposing there is not any pre exist data frame (df) and I just have 14000 genes and 3 vectors, how I can make such data frame from the scratch? A 3 columns data frame; 14000 genes in rows , TF column, marker_1 column and marker_2 column filled by 1 or 0 depends if each of 14000 genes exist in these vectors or not

ADD REPLY
1
Entering edit mode

Try this example first. If it works well, try with your own dataframe

ADD REPLY
1
Entering edit mode

I edited my answer ignoring a pre existing dataframe, using a list of genes

ADD REPLY
0
Entering edit mode

Sorry,

I have this data

> head(df)
     Cell SAMPLE CLUSTER
s1.1 s1.1      1      NA
s1.2 s1.2      1      NA
s1.3 s1.3      1      NA
s1.4 s1.4      1      NA
s1.5 s1.5      1      NA
s1.6 s1.6      1      NA
> tail(df)
         Cell SAMPLE CLUSTER
s2.148 s2.148      2      NA
s2.149 s2.149      2      NA
s2.150 s2.150      2      NA
s2.151 s2.151      2      NA
s2.152 s2.152      2      NA
s2.153 s2.153      2      NA

Sample 1 has three clusters of cells, for example

> head(t(ident0_16))
     [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]    [,9]    [,10]   [,11]   [,12]   [,13]   [,14]   [,15]   [,16]   [,17]   [,18]   [,19]   [,20]   [,21]   [,22]   [,23]  
[1,] "s1.1" "s1.2" "s1.3" "s1.6" "s1.7" "s1.8" "s1.9" "s1.10" "s1.11" "s1.16" "s1.17" "s1.18" "s1.20" "s1.21" "s1.23" "s1.24" "s1.25" "s1.26" "s1.28" "s1.29" "s1.34" "s1.37" "s1.38"
     [,24]   [,25]   [,26]   [,27]   [,28]   [,29]   [,30]   [,31]   [,32]   [,33]   [,34]   [,35]   [,36]   [,37]   [,38]   [,39]   [,40]   [,41]   [,42]   [,43]   [,44]   [,45]  
[1,] "s1.39" "s1.41" "s1.42" "s1.43" "s1.44" "s1.48" "s1.51" "s1.56" "s1.57" "s1.58" "s1.59" "s1.61" "s1.65" "s1.73" "s1.77" "s1.81" "s1.82" "s1.84" "s1.86" "s1.93" "s1.94" "s1.95"
     [,46]   [,47]   [,48]    [,49]    [,50]    [,51]    [,52]    [,53]    [,54]    [,55]    [,56]    [,57]    [,58]    [,59]    [,60]    [,61]    [,62]    [,63]    [,64]    [,65]   
[1,] "s1.97" "s1.99" "s1.100" "s1.102" "s1.107" "s1.111" "s1.112" "s1.120" "s1.121" "s1.122" "s1.123" "s1.126" "s1.127" "s1.128" "s1.131" "s1.134" "s1.135" "s1.137" "s1.138" "s1.139"
     [,66]    [,67]    [,68]    [,69]    [,70]    [,71]    [,72]    [,73]    [,74]    [,75]    [,76]    [,77]    [,78]    [,79]    [,80]    [,81]    [,82]    [,83]    [,84]   
[1,] "s1.145" "s1.150" "s1.151" "s1.154" "s1.158" "s1.159" "s1.160" "s1.161" "s1.163" "s1.164" "s1.168" "s1.173" "s1.175" "s1.180" "s1.181" "s1.186" "s1.187" "s1.190" "s1.192"
     [,85]    [,86]    [,87]    [,88]    [,89]    [,90]    [,91]    [,92]    [,93]    [,94]    [,95]   
[1,] "s1.193" "s1.194" "s1.195" "s1.197" "s1.198" "s1.199" "s1.201" "s1.202" "s1.206" "s1.208" "s1.209"
>

and, sample 2 has 2 clusters of cells

I need something in thirty column to put the cluster of each cell in front of that like

> head(cells)
       CELL SAMPLE CLUSTER
S1.1   S1.1      1       7
S1.10 S1.10      1       5
S1.11 S1.11      1       9
S1.15 S1.15      1       1
S1.16 S1.16      1       5
S1.17 S1.17      1       5

I tried to modify your function to do that but I failed

for (i in 1:nrow(df)){
  if (row.names(df)[i] %in% ident0_16){
    df$CLUSTER[i]=0
  }elseif{
  if (row.names(df)[i] %in% ident1_16){
    df$CLUSTER[i]=1
  }elseif{

  if (row.names(df)[i] %in% ident2_16){
    df$ident2_16[i]=2

  }
  }
  }
  }

How can I do that please?

ADD REPLY
1
Entering edit mode

Too much questions in here, I can't figure it out what you are trying to do...

Sample 1 has three clusters of cells

What are their names ? (ident0_16, ident1_16, ident2_16) ?

I need something in thirty column

Thirty, like 30 columns ? Maybe you mean the third column

in front of that

What is "that", a row ?

Why do you have a row cluster = 9, if you only have a total of 5 clusters (sample1 got 3 and sample2 got 2)

Why do you show me the head(cells) while in your script you modify the df ?

df$ident2_16[i]=2

The ident2_16 is not a column in df

Try to indent your code please (use 4 spaces)

ADD REPLY
0
Entering edit mode

Sorry, head(cells) and row cluster = 9 are from original publication that I am trying to adapt my own data based on that.

Sample 1 has 3 clusters namely 0 , 1 and 2 and sample 2 has 2 clusters namely 0 and 1. I have used ident0_16, ident1_16, ident2_16 for extracting which cells are assigned to each clusters from Seurat object, and I stored names of cells assigned to each cluster in ident0_16, ident1_16, ident2_16 for sample 1 and ident0_14, ident1_14 for sample 2. Actually in third column of df I want to put 0 for example s1.1 places in cluster 0 and put 1 if places in cluster 1 and put 2 if places in cluster 2. 0, 1 and 2 are names of clusters so that by third column R knows each cells is belong to which cluster.

ADD REPLY
1
Entering edit mode

You have to be sure that a cell cannot be in multiple clusters.

for (i in 1:nrow(df)){
    if ((row.names(df)[i] %in% ident0_16) | (row.names(df)[i] %in% ident0_14)){
        df$CLUSTER[i]=0
    }else if((row.names(df)[i] %in% ident1_16) | (row.names(df)[i] %in% ident1_14)){
        df$CLUSTER[i]=1
    }else if(row.names(df)[i] %in% ident2_16){
        df$ident2_16[i]=2
    }
}
ADD REPLY
0
Entering edit mode

Sorry, I firstly tried for one sample but returned error

   > for (i in 1:nrow(df)){
+   if (row.names(df)[i] %in% ident0_16{
Error: unexpected '{' in:
"for (i in 1:nrow(df)){
  if (row.names(df)[i] %in% ident0_16{"
>     df$CLUSTER[i]=0
Error in df$CLUSTER[i] = 0 : object 'i' not found
>   }else if(row.names(df)[i] %in% ident1_16{
Error: unexpected '}' in "  }"
>     df$CLUSTER[i]=1
Error in df$CLUSTER[i] = 1 : object 'i' not found
>   }else if(row.names(df)[i] %in% ident2_16){
Error: unexpected '}' in "  }"
>     df$CLUSTER[i]=2
Error in df$CLUSTER[i] = 2 : object 'i' not found
>   }
Error: unexpected '}' in "  }"
> }
Error: unexpected '}' in "}"
> 
> head(df)
     CELL CLUSTER
s1.1 s1.1      NA
s1.2 s1.2      NA
s1.3 s1.3      NA
s1.4 s1.4      NA
s1.5 s1.5      NA
s1.6 s1.6      NA
>
ADD REPLY
1
Entering edit mode

R just crashed because of parenthesis

for (i in 1:nrow(df)){

if (row.names(df)[i] %in% ident0_16{

Error: unexpected '{' in:

Missing a ) after ident0_16

I updated my script

ADD REPLY
0
Entering edit mode

Sorry, I tried just for ident0_16, ident1_16, ident2_16 but the same error

> head(df)
     CELL CLUSTER
s1.1 s1.1      NA
s1.2 s1.2      NA
s1.3 s1.3      NA
s1.4 s1.4      NA
s1.5 s1.5      NA
s1.6 s1.6      NA
> tail(df)
         CELL CLUSTER
s1.204 s1.204      NA
s1.205 s1.205      NA
s1.206 s1.206      NA
s1.207 s1.207      NA
s1.208 s1.208      NA
s1.209 s1.209      NA
> dim(df)
[1] 209   2
> length(ident0_16)
[1] 95
> length(ident1_16)
[1] 75
> length(ident2_16)
[1] 39
> for (i in 1:nrow(df)){
+     if ((row.names(df)[i] %in% ident0_16){
Error: unexpected '{' in:
"for (i in 1:nrow(df)){
    if ((row.names(df)[i] %in% ident0_16){"
>         df$CLUSTER[i]=0
Error in df$CLUSTER[i] = 0 : object 'i' not found
>     }else if((row.names(df)[i] %in% ident1_16){
Error: unexpected '}' in "    }"
>         df$CLUSTER[i]=1
Error in df$CLUSTER[i] = 1 : object 'i' not found
>     }else if(row.names(df)[i] %in% ident2_16){
Error: unexpected '}' in "    }"
>         df$CLUSTER[i]=2
Error in df$CLUSTER[i] = 2 : object 'i' not found
>     }
Error: unexpected '}' in "    }"
> }
Error: unexpected '}' in "}"
>
ADD REPLY
1
Entering edit mode

When you "just try" with ident0_16, ident1_16, ident2_16 you remove the cluster2 check in the if statements.

But you also remove a parenthesis.

if ((row.names(df)[i] %in% ident0_16){

Parenthesis is missing, at the end of the if statement. Add one or delete one from the beginning (not needed because you shrink the cluster2 check)

ADD REPLY

Login before adding your answer.

Traffic: 1816 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6