Question

Finding the largest possible combination of mutations

1

Entering edit mode

9.7 years ago

mkulecka ▴ 380

I have a large dataset, with about 81 tumors and 10 000 mutations. I would like to know largest possible number of shared mutations between tumors

        tum1 tum2 tum3 tum4
mut1    1    1    0    1
mut2    1    1    0    1
mut3    0    0    1    1
mut4    1    1    0    1

For instance in the example above the largest possible number of mutations is three between tumors 1, 2 and 4. The problem is I would like to know how to do this in computationally effective way - the question is non-trivial for larger datasets. I tried using powerset in python (https://docs.python.org/2/library/itertools.html#recipes) but it's not computationally effective.

next-gen programming algorithms • 2.4k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 9.7 years ago by mkulecka ▴ 380

1

Entering edit mode

Hi,

I'm wondering if the possible values in your matrix are only 0 or 1 ? If so, have your tried to consider your matrix as a vector of binary numbers that can be bitwise compared among themselves ?

Can we have a bit more details on the way you perform this actually ?

ADD REPLY • link 9.7 years ago by Thibault D. ▴ 700

0

Entering edit mode

I have created a powerset from mutations which appeared more than once and checked if a given combination is present in more than one sample - as you can imagine this is not computationally effective,

In this case, the possible values in matrix are only 0 and 1, that is correct. However, I have no idea how to implement what you are suggesting in python or R.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.7 years ago by mkulecka ▴ 380

Ram · Answer 1 · 2015-12-02

In R:

# produce a matrix of mutations (logical matrix),
# samples in columns, mutations in rows
mutations = matrix(rnorm(100)>1,ncol=10)

If we convert our matrix to a data frame, we can do things like this:

mutationsdf = data.frame(mutations)
Reduce('&',mutationsdf[,c(1,2,3)])

This last line will give us a logical vector of the mutations present in all three of the first three samples. Change the sample numbers as you see fit.

score 0 · Answer 2 · 2015-12-02

In Python (2 or 3), given the following table:

      tum1 tum2 tum3 tum4
mut1    1    1    0    1
mut2    1    1    0    1
mut3    0    0    1    1
mut4    1    1    0    1

We can write each sample as binary numbers: tum1 = 1101 (binary) = 13 (decimal). Therefore, we can use Bitwise Operators to compare those numbers:

# Original vectors
tum1 = int('1101', 2)
tum2 = int('1101', 2)
tum3 = int('0010', 2)
tum4 = int('1111', 2)
# Let's print tum1
tum1
> 13
#Bitwise comparison
bin(tum1 & tum2)[2:] 
> '1101'
bin(tum1 & tum3)[2:]
> '0001'
bin(tum1 & tum4)[2:]
> '1101'

A '0' at the position 'n' means the two conditions does not share the mutation 'n'. On the contrary, a '1' at the 'n'th position means the two conditions shares the mutation 'n'.

We have now the common mutations among the tumors. However, we would like to know which one has the highest number of common mutation. So let's consider the following Python command line. It sums our previous vectors:

sum(map(int, bin(tum1 & tum2)[2:]))
> 3
sum(map(int, bin(tum1 & tum3)[2:]))
> 0
sum(map(int, bin(tum1 & tum4)[2:]))
> 3

We can see that the couples (tum1, tum2) and (tum1, tum4) have 3 mutations in common, and that the couple (tum1, tum3) have none.

In terms of time, I have tried with two 100,000-values-long vectors, composed by random '0's and '1's, the comparison seemed instant to me. Reversing and joining the table was the most time-consuming step.