Question

Subsetting a wide dataframe in R with respect to common "trends" in values

1

Entering edit mode

5.8 years ago

rekren ▴ 40

Hello,

I am not good with loops in R and have a challenging data to subset.
Dimensions of the dataframe is 17 x 18000.
The value of first 200 columns are categorical binary and the rest of the columns have positive numerical values.

Representative dataframe is below;

View(df)

           Drug_1  Drug_2    . . .  Drug_200    Gene_1      Gene_2    . . .  Gene_17800 
Cell_1          1       1    . . .     1      3.410109     2.698543   . . .   2.991730 
Cell_2          0       1    . . .     1      6.190569     2.785505   . . .   2.893962 
Cell_3          1       1    . . .     0      5.503953     2.614325   . . .   2.787185 
Cell_4          1       1    . . .     1      3.314800     2.685167   . . .   3.746460 
Cell_5          0       1    . . .     1      3.702378     2.663557   . . .   5.541395 
Cell_6          1       1    . . .     1      6.623338     2.623761   . . .   2.892601 
Cell_7          0       0    . . .     1      3.855267     2.685530   . . .   2.879253 
Cell_8          1       1    . . .     1      3.813186     2.741521   . . .   7.204914 
Cell_9          1       1    . . .     0      4.010305     2.619892   . . .   2.930020 
Cell_10         0       1    . . .     1      3.769854     2.831024   . . .   4.495060 
Cell_11         0       1    . . .     0      4.325175     2.795230   . . .   3.181098 
Cell_12         1       1    . . .     1      5.502184     2.691975   . . .   2.928878 
Cell_13         1       0    . . .     1      5.711048     2.649376   . . .   2.897740 
Cell_14         1       1    . . .     1      3.990681     2.719580   . . .   2.934628 
Cell_15         1       0    . . .     1      5.650302     2.843495   . . .   3.025947 
Cell_16         1       1    . . .     1      3.250378     2.498467   . . .   6.397197 
Cell_17         1       1    . . .     1      5.366431     2.853150   . . .   5.033118

I want to explain the drug responses of cells (1 or 0) for a drug with their respective gene expression levels (high or low) via logistic regression models. However, as a first step I have to select features (genes in my case). The structure of my case is quite complex for implementing common feature selection approaches.
To manually pick contrast response inducing features for each of 200 drug, I planned to form a nested loop for drugs and subset the genes for each drug which are differentially expressed compared to opposite response giving cells.

To illustrate; I want to subset the genes which have different values (higher or lower) in 0 response giving cells compared to 1 response giving cells. And aiming to do this for all 200 drugs in a loop.

I hope I could explain my problem clearly. Can you help me to establish a working loop, please ?
Thanks in advanced.

R nested-loop subset • 1.2k views

ADD COMMENT • link updated 5.8 years ago by zx8754 12k • written 5.8 years ago by rekren ▴ 40

score 2 · Accepted Answer · 2019-07-25

2

Entering edit mode

5.8 years ago

zx8754 12k

Here is the start, double loop example:

# example data
df <- mtcars[1:8]
colnames(df)[1:4] <- paste0("drug_", 1:4)
colnames(df)[5:8] <- paste0("gene_", 1:4)

# double loop
sapply(colnames(df)[1:4], function(drug){
  sapply(colnames(df)[5:8], function(gene){
    coef(
      lm(formula(paste(drug, gene, sep = "~")), 
         data = df[, c(drug, gene)])
      )[ 2 ]
    })
  })

# output
#                  drug_1     drug_2     drug_3    drug_4
# gene_1.gene_1  7.678233 -2.3379172 -164.62780 -57.54523
# gene_2.gene_2 -5.344472  1.4282442  112.47814  46.16005
# gene_3.gene_3  1.412125 -0.5909041  -30.08039 -27.17368
# gene_4.gene_4  7.940476 -2.8730159 -174.69286 -98.36508

ADD COMMENT • link 5.8 years ago by zx8754 12k

0

Entering edit mode

Thanks a lot for your suggestion.

I followed your guidance and run the modified loop which lasted around 8 hours of computation on i7 processor 16 Gb ram.

I interpret the output matrix which has values in the range of [-1,1], if the value is closer to 1 that means it effects the response positively (I presume those are beta values) if it is closer to -1 vice versa; or should I treat these values like p-values ?

ADD REPLY • link 5.8 years ago by rekren ▴ 40

0

Entering edit mode

Read about lm() output, here:

Interpretation of R's lm() output

As far as I can see, this post answers your original question - "how to do it", and above link should be enough to answer "how to interpret the results".

ADD REPLY • link 5.8 years ago by zx8754 12k