I have a BED like file with 19 columns which I have reduced down to the following format:
gencode.v19 A_Heart_AG A_Heart_BC A_Kidney_AG A_Kidney_BC A_Liver_AG A_Liver_BC A_Lung_AG A_Lung_BC A_Stomach_BC A_Stomach_OG_0288 A_Stomach_OG_0393 A_Stomach_OG_1840
1 0 0 1 1 1 1 1 1 1 1 1 0
1 1 0 1 1 1 0 0 1 1 1 1 0
0 0 0 0 1 0 0 0 0 0 0 0 0
1 0 1 1 1 1 1 1 1 1 1 1 0
1 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 0 1 1 1 0 1 0
1 0 0 1 1 1 0 0 1 0 0 1 0
1 1 1 1 0 1 0 1 1 0 1 1 0
0 0 0 1 0 0 0 0 0 0 0 0 0
I am looking to group all the tissues together like so: Heart_AG
, Heart_BC
would become just "Heart". So on and so forth. Then I want to take the resulting file and count how many times each library has an intron present. This is being done to create a 6 way venn diagram.
I thought of using an awk command but I would like to automate the process rather than massage the file un-necessarily.
How should I go about doing this?
Further Details
The 0's and 1's represent the presence of introns in various libraries. What I mean by grouping is to take for example "Heart" which has two vendors: Agilent and Biochain. Look for introns in either library and if they are present then count as "1", like so:
A_Heart_AG A_Heart_BC Count
1 0 1
0 0 0
1 1 1
This I would have to do for all libraries and then make a six way venn diagram. Six counting the Gencode annotations. The venn would be made by hand, or pass it through an R library which could do it for me.
It's not quite clear to me exactly what you wish to do.
Does 0 and 1 represent intron yes/no at several different locations (rows)?
When you say "group" does that mean "sum values from all samples, by tissue"? Per row position? What does the final look like (if done by hand)?
You say six way Venn diagram, but there are only five different tissues in that textfile
On a side note I would just describe the file as white space separated - it hasn't got much to do with a bed file.
Amended the question.
There is nothing BED-like about that file. It looks like a matrix of features (columns) for libraries? (rows?)
If you want to do a disjunction operation, you could read each row of 0/1 values into an array. Then apply that
OR
or|
boolean operation on subsets of columns (e.g. apply the operation on the values in columns 2 and 3 in each row, which gives you a "heart" value for that row, repeating for other pairs or triplets etc. of other tissue types).When you finish processing a row of values into a smaller set of condensed values, print out a new row to standard output.