Question

Grouping columns in a BED like file

0

Entering edit mode

10.3 years ago

ruchiksy ▴ 50

I have a BED like file with 19 columns which I have reduced down to the following format:

gencode.v19  A_Heart_AG  A_Heart_BC  A_Kidney_AG  A_Kidney_BC  A_Liver_AG  A_Liver_BC  A_Lung_AG  A_Lung_BC  A_Stomach_BC  A_Stomach_OG_0288  A_Stomach_OG_0393  A_Stomach_OG_1840
1            0           0           1            1            1           1           1          1          1             1                  1                  0
1            1           0           1            1            1           0           0          1          1             1                  1                  0
0            0           0           0            1            0           0           0          0          0             0                  0                  0
1            0           1           1            1            1           1           1          1          1             1                  1                  0
1            0           0           0            0            0           0           0          0          0             0                  0                  0
1            1           1           1            1            1           0           1          1          1             0                  1                  0
1            0           0           1            1            1           0           0          1          0             0                  1                  0
1            1           1           1            0            1           0           1          1          0             1                  1                  0
0            0           0           1            0            0           0           0          0          0             0                  0                  0

I am looking to group all the tissues together like so: Heart_AG, Heart_BC would become just "Heart". So on and so forth. Then I want to take the resulting file and count how many times each library has an intron present. This is being done to create a 6 way venn diagram.

I thought of using an awk command but I would like to automate the process rather than massage the file un-necessarily.

How should I go about doing this?

Further Details

The 0's and 1's represent the presence of introns in various libraries. What I mean by grouping is to take for example "Heart" which has two vendors: Agilent and Biochain. Look for introns in either library and if they are present then count as "1", like so:

A_Heart_AG     A_Heart_BC   Count
1              0            1
0              0            0
1              1            1

This I would have to do for all libraries and then make a six way venn diagram. Six counting the Gencode annotations. The venn would be made by hand, or pass it through an R library which could do it for me.

python introns • 2.2k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by ruchiksy ▴ 50

0

Entering edit mode

It's not quite clear to me exactly what you wish to do.

Does 0 and 1 represent intron yes/no at several different locations (rows)?

When you say "group" does that mean "sum values from all samples, by tissue"? Per row position? What does the final look like (if done by hand)?

You say six way Venn diagram, but there are only five different tissues in that textfile

On a side note I would just describe the file as white space separated - it hasn't got much to do with a bed file.

ADD REPLY • link 10.3 years ago by David Fredman ★ 1.1k

0

Entering edit mode

Amended the question.

ADD REPLY • link 10.3 years ago by ruchiksy ▴ 50

1

Entering edit mode

There is nothing BED-like about that file. It looks like a matrix of features (columns) for libraries? (rows?)

If you want to do a disjunction operation, you could read each row of 0/1 values into an array. Then apply that OR or | boolean operation on subsets of columns (e.g. apply the operation on the values in columns 2 and 3 in each row, which gives you a "heart" value for that row, repeating for other pairs or triplets etc. of other tissue types).

When you finish processing a row of values into a smaller set of condensed values, print out a new row to standard output.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Alex Reynolds 36k

score 0 · Answer 1 · 2014-08-06

Here's a solution using R.

Step1: Rename column names to be just gencode or tissue name (in two steps of removing letters using regular expression matches, this could be compacted to a single step)

Step2: Sum counts across columns grouped by tissue name. This is done here using the rowsums function in the stats package. Because the rowsums function only operates by row and you wish to sum across columns, the matrix is transposed for the calc, and then transposed back for presentation via a nested function call.

dat = read.csv("your_table.csv")
labels = gsub("^\\w_", "", names(dat)) #remove prefix_
names(dat) = gsub("_\\w+$", "", labels) #remove _suffix

library(stats)
t(rowsum(t(dat), names(dat)))