Entering edit mode
4.6 years ago
r.tor
▴
50
I want to split up a matrix called 'matrix' into chunks based on the values in the first column, 'GENE', and save each chunk as a separate .gz file. So that, there would be subsets of the matrix, each of which will have the lines corresponding to the only 3 GENEs, just not the last one as shown in the example below. The script should be prepared in Bash.
Input:
> matrix
GENE Individual Expr1 Expr2 Expr3
ENSG1 indv1 0.1 0.2 0.3
ENSG1 indv2 0.1 0.2 0.3
ENSG2 indv1 0.1 0.2 0.3
ENSG2 indv2 0.1 0.2 0.3
ENSG3 indv1 0.1 0.2 0.3
ENSG3 indv2 0.1 0.2 0.3
ENSG4 indv1 0.1 0.2 0.3
ENSG4 indv2 0.1 0.2 0.3
ENSG5 indv1 0.1 0.2 0.3
ENSG5 indv2 0.1 0.2 0.3
ENSG6 indv1 0.1 0.2 0.3
ENSG6 indv2 0.1 0.2 0.3
ENSG7 indv1 0.1 0.2 0.3
ENSG7 indv2 0.1 0.2 0.3
ENSG8 indv1 0.1 0.2 0.3
ENSG8 indv2 0.1 0.2 0.3
ENSG9 indv1 0.1 0.2 0.3
ENSG9 indv2 0.1 0.2 0.3
ENSG10 indv1 0.1 0.2 0.3
ENSG10 indv2 0.1 0.2 0.3
Outputs:
> matrix.chunk1
GENE Individual Expr1 Expr2 Expr3
ENSG1 indv1 0.1 0.2 0.3
ENSG1 indv2 0.1 0.2 0.3
ENSG2 indv1 0.1 0.2 0.3
ENSG2 indv2 0.1 0.2 0.3
ENSG3 indv1 0.1 0.2 0.3
ENSG3 indv2 0.1 0.2 0.3
> matrix.chunk2
GENE Individual Expr1 Expr2 Expr3
ENSG4 indv1 0.1 0.2 0.3
ENSG4 indv2 0.1 0.2 0.3
ENSG5 indv1 0.1 0.2 0.3
ENSG5 indv2 0.1 0.2 0.3
ENSG6 indv1 0.1 0.2 0.3
ENSG6 indv2 0.1 0.2 0.3
> matrix.chunk3
GENE Individual Expr1 Expr2 Expr3
ENSG7 indv1 0.1 0.2 0.3
ENSG7 indv2 0.1 0.2 0.3
ENSG8 indv1 0.1 0.2 0.3
ENSG8 indv2 0.1 0.2 0.3
ENSG9 indv1 0.1 0.2 0.3
ENSG9 indv2 0.1 0.2 0.3
> matrix.chunk4
GENE Individual Expr1 Expr2 Expr3
ENSG10 indv1 0.1 0.2 0.3
ENSG10 indv2 0.1 0.2 0.3
I would appreciate any suggestion.
I'm not providing the code, but here is what you can do
prepare a list object where each element of the list contains gene names, e.g.
Loop over this list object and collect your matrix chunks by matching list contents (ie ENSG1, 2...) with original matrix and save it to a file