perform t-SNE on the data

Question

Make large matrix data interpretable. How?

0

Entering edit mode

2.7 years ago

Ankit ▴ 520

Hi,

I have a big matrix of quantitation value (range 0 to 1). The x-axis and y-axis are different proteins. I want to categorise and represent the data into different clusters/groups. What is the best way to do that?

Some possibilities I thought:

Correlation matrix plot (too big)

Heatmap (too big)

PCA (will be messy)

Dendrogram (not so useful)

Any strategies ?

Thank you

R matrix • 2.2k views

ADD COMMENT • link updated 2.7 years ago by Mensur Dlakic ★ 29k • written 2.7 years ago by Ankit ▴ 520

2

Entering edit mode

Can you explain your data in more detail, and the general question(s) you're trying to answer? It would be difficult to give an answer without knowing more.

ADD REPLY • link 2.7 years ago by rpolicastro 13k

2

Entering edit mode

Unless you provide more information on the data itself (how many dimensions does one feature have?) and your hypothesis, it is hard to answer your question. Have you already considered applying either t-SNE or UMAP dimension reduction techniques, which may be more suitable than PCA?

ADD REPLY • link 2.7 years ago by Matthias Zepper 5.1k

0

Entering edit mode

Thank you guys!

I will look into t-SNE and UMAP.

Basically I have dataframe of values which I calculated as follows:

Step 1.

Common numbers of genes bound by protein1 + protein 2 Divided by Total genes bound by protein1

Similarly for protein2,

In this way I created a matrix for ratio 0 to 1. For different combinations.

Step 2.

Next I want to categories N number of proteins which has more/less shared genes than other with protein1. And so on...for other respective proteinsN.

Step 3.

Then I want to predict a network of proteins which could regulate same genes. Basically I want to use step2 data and get some list output that carry set of possible proteins which could co-regulate the Gene 1, Gene 2 and so on..(hypothetically in sillico support)

Step 4.

We will test co-ip or immunoprecipitation for some genes of interest in wt and ko .

More info :

Dimensions are 560 X 560 (all against all)

Value range 0 to 1 , with 0 no genes are shared and 1 genes bound are exactly the same

Let me know if more information are required.

ADD REPLY • link 2.7 years ago by Ankit ▴ 520

score 2 · Answer 1 · 2023-01-03

The more information you provide, the more likely it is we can offer good suggestions. In particular, what is a "large matrix" and what exactly "interpretable" means to you? The answer is different for a matrix that is 500-1000 proteins large (I will assume this) versus a matrix that is 100,000 proteins large. The answer is also different if by interpretable you mean a global number of protein groups (I will assume this) versus understanding individual protein relationships on a more granular level.

All the strategies you proposed are viable. Dimensionality reduction with PCA may work, but I will give you two examples of doing it with t-SNE. A principal difference is that PCA globally preserves distances between data points but doesn't always separate data points clearly as it relies on linear relationships, while t-SNE is non linear and often separates points better, but preserves only their local relationships. If you want to try t-SNE, I recommend openTSNE.

Both embeddings below are for 699 proteins. In the first case we start from a symmetric distance matrix, which partially looks like this:

vector_0001,vector_0002,vector_0003,vector_0004,vector_0005  ....
0.000000,0.652870,0.648000,0.675000,0.639257,0.640854,0.685039,0.636119  ....
0.652870,0.000000,0.373832,0.730000,0.512684,0.383178,0.457944,0.485175 ....

And here is the embedding:

t-SNE plot 1

To me that is plenty interpretable, but I don't know exactly what you are trying to achieve.

For the same group of proteins we can start with a protein language model from here, specifically ProtT5-XL-U50, and create a 1024-vector matrix for each protein of interest. It looks in part like this:

vector_0001,vector_0002,vector_0003,vector_0004,vector_0005  ....
0.053524988,0.034982558,0.056690590,0.032287619,0.045326617 ....
0.029987951,-0.003866601,0.032291183,0.026446213,0.029312980 ....

And here is the embedding from the matrix above:

t-SNE plot 2

Again, this is plenty interpretable for my needs, but your mileage may vary.

Step 1.

Step 2.

Step 3.

Step 4.

More info :

perform t-SNE on the data

plot the results using the plot function

plot the results using the `plot` function