Question

how to get correlation between the counts over each gene at the same timepoint (two replicates)

1

Entering edit mode

5.9 years ago

Lila M ★ 1.3k

Hi everybody, I have the counts (obtained by HTSeq) for a lot of genes(~58,000) at different time points (replicates).

gene                           t1_S1    t1_S2
ENSG00000000003.14              0        0
ENSG00000000005.5               0        0
ENSG00000000419.12              1        3
 [...]

I woul like to calculate the correlation between the counts over each gene at the same timepoint to understand how reproducible the replication timing and progression is for each repeat. Any suggestions?

RNA-Seq HTSeq replication correlation • 3.2k views

ADD COMMENT • link updated 5.9 years ago by Nicolas Rosewick 11k • written 5.9 years ago by Lila M ★ 1.3k

1

Entering edit mode

Check out the cor function in R. Different kinds of correlation measures are available, including Spearman and Pearson.

ADD REPLY • link 5.9 years ago by ATpoint 85k

1

Entering edit mode

This is what I am doing, but as I have a huge number of genes, R gets stuck . This is what I'm trying:

xx <- read.table(file="matrix_count", sep="\t", header = T)
cor(t(xx), method="pearson")

any other suggestion?

ADD REPLY • link 5.9 years ago by Lila M ★ 1.3k

1

Entering edit mode

Do I understand correctly that you aim to calculate 58000 correlation coefficients?

ADD REPLY • link 5.9 years ago by ATpoint 85k

1

Entering edit mode

Read count correlation between samples

ADD REPLY • link 5.9 years ago by h.mon 35k

score 5 · Accepted Answer · 2019-01-14

5

Entering edit mode

5.9 years ago

Nicolas Rosewick 11k

Do you want to test the correlation between the different timepoints or between the different genes.

Let say you have 10 timepoints and 58000 genes

To test the different timepoints :

cor(xx, method="pearson")

will give you a 10x10 matrix , so 100 correlations calculation (even though I guess the cor function is smart and should not compute twice the cor function between col A and col B ; and between col B and col A ; thus 45 correlations should be computed)

To test the different genes (in a pairwise manner) :

cor(t(xx), method="pearson")

here a 58,000 x 58,000 matrix , = 3.364e+09 correlations (or 1,681,971,000 correlations if cor function is smart). That's why R crashes, it will take to long to compute so many correlations.

Edit based on OP comments

Use the coefficent of variation : https://en.wikipedia.org/wiki/Coefficient_of_variation :

dat.coeff.var <- apply(dat,1,function(x){sd(x)/mean(x)})

ADD COMMENT • link 5.9 years ago by Nicolas Rosewick 11k

1

Entering edit mode

Maybe I miss explain what I want. I want to know the correlation for, lets say gene ENSG00000000003.14 in the two replicates, to see if there are differences in each replicate for each gene. I'm not interested in the correlation ENSG00000000003.14 and ENSG00000000005.5. Has more sense?

ADD REPLY • link 5.9 years ago by Lila M ★ 1.3k

1

Entering edit mode

Ok so you want to check the correlation between replicates : then cor(xx,method="pearson")

ADD REPLY • link 5.9 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Not exactly, because it gives to me the cor between replicates, and what I want to know is if the counts for the gene ENSG00000000003.14 is different in t1_S1 and t1_S2 (and also for the others genes)

ADD REPLY • link 5.9 years ago by Lila M ★ 1.3k

2

Entering edit mode

Use maybe the coefficent of variation : https://en.wikipedia.org/wiki/Coefficient_of_variation : dat.coeff.var <- apply(dat,1,function(x){sd(x)/mean(x)})

ADD REPLY • link 5.9 years ago by Nicolas Rosewick 11k

1

Entering edit mode

that's exactly what I want! thanks!

ADD REPLY • link 5.9 years ago by Lila M ★ 1.3k

0

Entering edit mode

ok great. I modified my answer to archive the right answer. If the answer suits you you can accept the question.

ADD REPLY • link 5.9 years ago by Nicolas Rosewick 11k

1

Entering edit mode

There is no correlation for a single pair of measures. The correlation between samples will give you a general view of how similar samples are, and you can plot the values to check outliers. However, you have to take into account sample sequencing depth.