How to create matrix from row data
2
1
Entering edit mode
3.7 years ago
Info.shi ▴ 30

I want to create a matrix file for thousands of files I have data in the row in each file.

file-

                      2-1            3-1            3-2              4-1             4-2            4-3
     GeneA           0.514          0.535          0.436             0.530          0.388          0.418

output should be-

[1     2     3     4]

[1]        
[2]  0.514       
[3]  0.535 0.436       
[4]  0.530 0.388 0.418    

Thank you so much!

sed perl R awk • 1.3k views
ADD COMMENT
1
Entering edit mode

Explain a little bit more. You have thousands of files just with 1 line and you want to put them together? If so, a simple cat command will do the trick.

ADD REPLY
0
Entering edit mode

Thank you so much for your reply I am sorry for confusing question. I already have file in gene by gene I need to create matrix for each row data. I have thousand of gene data. so I need to create matrix file for each. I split my file so that I can create matrix file for each.

Thank you so much!

ADD REPLY
2
Entering edit mode
3.7 years ago

If all the files have similar, simple format as explained above, you can try following:

 $ cat test.txt                                                                                                                                                        
    2-1 3-1 3-2 4-1 4-2 4-3 7-3
GeneA   0.514   0.535   0.436   0.530   0.388   0.418   10.2

$ sed 's/-[0-9]*\t*/\t/g' test.txt | datamash transpose -s --no | datamash -s --narm --header-in groupby 1 collapse 2  | sed '/^\s*$/d;s/,/\t/g' | awk '{while(++i < $1)print i}1'

1
2   0.514
3   0.535   0.436
4   0.530   0.388   0.418
5
6
7   10.2

I modified input a little bit to check if the code fills missing series.

datamash is available in most of the *buntu/debian repos.

ADD COMMENT
1
Entering edit mode

input (reused from zx8754) and libraries:

d <- read.table(text ="
 2-1 3-1 3-2 4-1 4-2 4-3
GeneA 0.514 0.535 0.436 0.530 0.388 0.418
GeneB 0.111 0.222 0.333 0.444 0.555 0.666", 
                check.names = FALSE)

library(dplyr)
library(tidyr)
library(tibble)
library(stringr)
library (splitstackshape)

R solution with out list names:

> d %>%
+     rownames_to_column("genes") %>% 
+     pivot_longer(-genes,names_to = "k", values_to = "v") %>%
+     mutate(k=str_split_fixed(k,"-",2)[,1],
+            k=as.integer(k)) %>% 
+     group_by(genes) %>%
+     group_map(~{
+         tibble(complete(k = 1:max(k), fill = list(vol = 0),data=.x) %>%
+             group_by(k) %>% 
+             mutate(v=paste(v, collapse = ",")) %>% 
+             distinct() %>% 
+             ungroup() %>% 
+             cSplit (.,"v")
+             )
+     })

[[1]]
# A tibble: 4 x 4
      k    v_1    v_2    v_3
  <int>  <dbl>  <dbl>  <dbl>
1     1 NA     NA     NA    
2     2  0.514 NA     NA    
3     3  0.535  0.436 NA    
4     4  0.53   0.388  0.418

[[2]]
# A tibble: 4 x 4
      k    v_1    v_2    v_3
  <int>  <dbl>  <dbl>  <dbl>
1     1 NA     NA     NA    
2     2  0.111 NA     NA    
3     3  0.222  0.333 NA    
4     4  0.444  0.555  0.666

None of the group ops in dplyr (probably tidyverse) retain names of the lists, which is a pain. Here is another solution to retain names:

> d %>%
+     rownames_to_column("genes") %>%
+     pivot_longer(-genes, names_to = "k", values_to = "v") %>%
+     mutate(k = str_split_fixed(k, "-", 2)[, 1],
+            k = as.integer(k)) %>%
+     group_by(genes) %>%
+     complete(k = 1:max(k), fill = list(vol = 0)) %>%
+     group_by(genes, k) %>%
+     mutate(v = paste(v, collapse = ",")) %>%
+     distinct() %>% 
+     ungroup() %>% 
+     cSplit(.,"v") %>% 
+     split(.$genes) %>% 
+     map(., ~ (data=.x %>% select(-genes)))
$GeneA
   k   v_1   v_2   v_3
1: 1    NA    NA    NA
2: 2 0.514    NA    NA
3: 3 0.535 0.436    NA
4: 4 0.530 0.388 0.418

$GeneB
   k   v_1   v_2   v_3
1: 1    NA    NA    NA
2: 2 0.111    NA    NA
3: 3 0.222 0.333    NA
4: 4 0.444 0.555 0.666

Your expected solution has a column problem. Since maximum columns in OP data is 3 columns, 4 columns in ouput was okay. If there are more than 3 columns (i.e for eg. 3.7, 3.8, 4.10), expected column number would be (4 in this case) would be incorrect. Number of columns should be equivalent to number of values, IMO.

ADD REPLY
1
Entering edit mode
3.7 years ago
zx8754 12k

Using R, data.table reshaping twice:

# example input for two genes, assuming every row has the same number of columns
d <- read.table(text ="
 2-1 3-1 3-2 4-1 4-2 4-3
GeneA 0.514 0.535 0.436 0.530 0.388 0.418
GeneB 0.111 0.222 0.333 0.444 0.555 0.666", 
check.names = FALSE)

library(data.table)

# keep the gene names
d$Gene <- rownames(d)
setDT(d)

# reshape wide-to-long 
d <- melt(d, id.vars = "Gene")

# split on "-", apply factor levels for long-to-wide reshape with "fill"
d[, c("c1", "c2") := tstrsplit(variable, split = "-", fixed = TRUE)]
d[, c("c1", "c2") := lapply(.SD, factor, levels = 1:max(c(c1, c2))), .SDcols = c("c1", "c2") ]
d <- dcast(d, Gene + c1 ~ c2, drop = FALSE)

# split on gene names and convert to matrix
lapply(split(d[, -(1:2)], d$Gene), as.matrix)

# $GeneA
#          1     2     3  4
# [1,]    NA    NA    NA NA
# [2,] 0.514    NA    NA NA
# [3,] 0.535 0.436    NA NA
# [4,] 0.530 0.388 0.418 NA
# 
# $GeneB
#          1     2     3  4
# [1,]    NA    NA    NA NA
# [2,] 0.111    NA    NA NA
# [3,] 0.222 0.333    NA NA
# [4,] 0.444 0.555 0.666 NA
ADD COMMENT

Login before adding your answer.

Traffic: 3377 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6