Distance for gene expression samples
1
1
Entering edit mode
5.6 years ago
chipolino ▴ 150

Hi,

Imagine I have 200 samples, and for each, I have expression values of 500 genes (so it's a matrix with 500 rows and 200 columns). Expression values are normalized, it’s TPM. I need to compute a distance between the samples. What is the better way of doing it? Should I take log(TPM+1), then scale with z-transformation, and do Euclidean distance? Or there is a better way? And should I scale rows (calculate z-scores within genes but across samples) or columns?

Thanks

RNA-Seq distance • 3.3k views
ADD COMMENT
0
Entering edit mode

Hi, maybe try a tutorial first to understand the basics. With google I find many RNA-seq tutorials, like this one for example. Good luck.

ADD REPLY
0
Entering edit mode

Well, I do know the basics. And I know that people usually calculate the distance with log CPM. However, my question is about TPM. What did I miss in the basics in your opinion?

ADD REPLY
0
Entering edit mode

I mean the basics to understand the difference between row or column z-scores

ADD REPLY
0
Entering edit mode

Row z-score is to make sure that genes with higher expression don't have much influence on what samples are similar to each other. I just wanted to make sure that it makes sense and I didn't miss anything

ADD REPLY
1
Entering edit mode

Yes, this is the default way that pheatmap and heatmap.2 do it, i.e., scale by row. It is also the way that I usually scale my data to Z-scale. Assuming you have samples as columns and genes as rows, this can be done via:

t(scale(t(x)))

I show that each produces the same output in this proof, here: A: cannot replicate the pheatmap scale function

ADD REPLY
2
Entering edit mode
5.6 years ago

Hey, there is actually no standard way. For example, either pheatmap() or heatmap.2() (gplots) (cannot remember which) will scale your data to Z-scores, first, and then perform clustering on these [Z-scores] and produce a heatmap using the same; whereas, the other function will perform clustering on the un-scaled data and then present the Z-scaled data in the actual heatmap.

You can have complete control over your own scaling by switching off the scaling feature in both functions (or whatever other function you're using).

Your logged TPM+1 data should already be on a 'presentable' distribution. I see no issue further scaling this to Z-scores and performing both clustering and generating the heatmap on that Z-scaled data. Euclidean distance would be fine, or 1 minus Pearson correlation. Ward's linkage (ward.D2) as the linkage method will then give an evenly distributed tree layout.

Kevin

ADD COMMENT
1
Entering edit mode

Thank you for your awesome answer!

ADD REPLY

Login before adding your answer.

Traffic: 1412 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6