Question

How to create matrix from row data

1

Entering edit mode

3.6 years ago

Info.shi ▴ 30

I want to create a matrix file for thousands of files I have data in the row in each file.

file-

                      2-1            3-1            3-2              4-1             4-2            4-3
     GeneA           0.514          0.535          0.436             0.530          0.388          0.418

output should be-

[1     2     3     4]

[1]        
[2]  0.514       
[3]  0.535 0.436       
[4]  0.530 0.388 0.418

Thank you so much!

sed perl R awk • 1.3k views

ADD COMMENT • link updated 3.6 years ago by cpad0112 21k • written 3.6 years ago by Info.shi ▴ 30

1

Entering edit mode

Explain a little bit more. You have thousands of files just with 1 line and you want to put them together? If so, a simple cat command will do the trick.

ADD REPLY • link 3.6 years ago by jordi.planells ▴ 480

0

Entering edit mode

Thank you so much for your reply I am sorry for confusing question. I already have file in gene by gene I need to create matrix for each row data. I have thousand of gene data. so I need to create matrix file for each. I split my file so that I can create matrix file for each.

Thank you so much!

ADD REPLY • link 3.6 years ago by Info.shi ▴ 30

score 2 · Answer 1 · 2021-04-26

2

Entering edit mode

3.6 years ago

cpad0112 21k

If all the files have similar, simple format as explained above, you can try following:

 $ cat test.txt                                                                                                                                                        
    2-1 3-1 3-2 4-1 4-2 4-3 7-3
GeneA   0.514   0.535   0.436   0.530   0.388   0.418   10.2

$ sed 's/-[0-9]*\t*/\t/g' test.txt | datamash transpose -s --no | datamash -s --narm --header-in groupby 1 collapse 2  | sed '/^\s*$/d;s/,/\t/g' | awk '{while(++i < $1)print i}1'

1
2   0.514
3   0.535   0.436
4   0.530   0.388   0.418
5
6
7   10.2

I modified input a little bit to check if the code fills missing series.

datamash is available in most of the *buntu/debian repos.

ADD COMMENT • link 3.6 years ago by cpad0112 21k

1

Entering edit mode

input (reused from zx8754) and libraries:

d <- read.table(text ="
 2-1 3-1 3-2 4-1 4-2 4-3
GeneA 0.514 0.535 0.436 0.530 0.388 0.418
GeneB 0.111 0.222 0.333 0.444 0.555 0.666", 
                check.names = FALSE)

library(dplyr)
library(tidyr)
library(tibble)
library(stringr)
library (splitstackshape)

R solution with out list names:

> d %>%
+     rownames_to_column("genes") %>% 
+     pivot_longer(-genes,names_to = "k", values_to = "v") %>%
+     mutate(k=str_split_fixed(k,"-",2)[,1],
+            k=as.integer(k)) %>% 
+     group_by(genes) %>%
+     group_map(~{
+         tibble(complete(k = 1:max(k), fill = list(vol = 0),data=.x) %>%
+             group_by(k) %>% 
+             mutate(v=paste(v, collapse = ",")) %>% 
+             distinct() %>% 
+             ungroup() %>% 
+             cSplit (.,"v")
+             )
+     })

[[1]]
# A tibble: 4 x 4
      k    v_1    v_2    v_3
  <int>  <dbl>  <dbl>  <dbl>
1     1 NA     NA     NA    
2     2  0.514 NA     NA    
3     3  0.535  0.436 NA    
4     4  0.53   0.388  0.418

[[2]]
# A tibble: 4 x 4
      k    v_1    v_2    v_3
  <int>  <dbl>  <dbl>  <dbl>
1     1 NA     NA     NA    
2     2  0.111 NA     NA    
3     3  0.222  0.333 NA    
4     4  0.444  0.555  0.666

None of the group ops in dplyr (probably tidyverse) retain names of the lists, which is a pain. Here is another solution to retain names:

> d %>%
+     rownames_to_column("genes") %>%
+     pivot_longer(-genes, names_to = "k", values_to = "v") %>%
+     mutate(k = str_split_fixed(k, "-", 2)[, 1],
+            k = as.integer(k)) %>%
+     group_by(genes) %>%
+     complete(k = 1:max(k), fill = list(vol = 0)) %>%
+     group_by(genes, k) %>%
+     mutate(v = paste(v, collapse = ",")) %>%
+     distinct() %>% 
+     ungroup() %>% 
+     cSplit(.,"v") %>% 
+     split(.$genes) %>% 
+     map(., ~ (data=.x %>% select(-genes)))
$GeneA
   k   v_1   v_2   v_3
1: 1    NA    NA    NA
2: 2 0.514    NA    NA
3: 3 0.535 0.436    NA
4: 4 0.530 0.388 0.418

$GeneB
   k   v_1   v_2   v_3
1: 1    NA    NA    NA
2: 2 0.111    NA    NA
3: 3 0.222 0.333    NA
4: 4 0.444 0.555 0.666

Your expected solution has a column problem. Since maximum columns in OP data is 3 columns, 4 columns in ouput was okay. If there are more than 3 columns (i.e for eg. 3.7, 3.8, 4.10), expected column number would be (4 in this case) would be incorrect. Number of columns should be equivalent to number of values, IMO.

ADD REPLY • link 3.6 years ago by cpad0112 21k

score 1 · Answer 2 · 2021-04-26

Using R, data.table reshaping twice:

# example input for two genes, assuming every row has the same number of columns
d <- read.table(text ="
 2-1 3-1 3-2 4-1 4-2 4-3
GeneA 0.514 0.535 0.436 0.530 0.388 0.418
GeneB 0.111 0.222 0.333 0.444 0.555 0.666", 
check.names = FALSE)

library(data.table)

# keep the gene names
d$Gene <- rownames(d)
setDT(d)

# reshape wide-to-long 
d <- melt(d, id.vars = "Gene")

# split on "-", apply factor levels for long-to-wide reshape with "fill"
d[, c("c1", "c2") := tstrsplit(variable, split = "-", fixed = TRUE)]
d[, c("c1", "c2") := lapply(.SD, factor, levels = 1:max(c(c1, c2))), .SDcols = c("c1", "c2") ]
d <- dcast(d, Gene + c1 ~ c2, drop = FALSE)

# split on gene names and convert to matrix
lapply(split(d[, -(1:2)], d$Gene), as.matrix)

# $GeneA
#          1     2     3  4
# [1,]    NA    NA    NA NA
# [2,] 0.514    NA    NA NA
# [3,] 0.535 0.436    NA NA
# [4,] 0.530 0.388 0.418 NA
# 
# $GeneB
#          1     2     3  4
# [1,]    NA    NA    NA NA
# [2,] 0.111    NA    NA NA
# [3,] 0.222 0.333    NA NA
# [4,] 0.444 0.555 0.666 NA