Question

Compute correlation for depth of coverage between 2 datasets with different length in R

0

Entering edit mode

6.6 years ago

Ana ▴ 200

Hello everyone, I am trying to compute correlation between depth of coverage for individuals with homozygote genotypes vs. individuals with heterozygote genotypes.

here is the fist few lines of my datasets (note, I have only one position across all individuals):

head(homozygotes)
   chrom       pos dp  ind_id genotype_id
1      1 115258827 12 HG00099         0|0
3      1 115258827  8 HG00101         0|0
4      1 115258827  6 HG00103         0|0
8      1 115258827  2 HG00114         0|0
9      1 115258827  8 HG00115         0|0
12     1 115258827  8 HG00128         0|0

head(heterozygotes)
   chrom       pos dp  ind_id genotype_id
2      1 115258827  5 HG00100         0|1
14     1 115258827  5 HG00133         0|1
16     1 115258827  5 HG00138         0|1
19     1 115258827  2 HG00160         1|0
27     1 115258827  4 HG00232         1|0
33     1 115258827  9 HG00251         1|0

these 2 datasets differ in length. Therefore, when I simply try cor.test(homozygotes$dp,heterozygotes$dp), I get an error message:

"Error in cor : incompatible dimensions.

I have searched to find a solution, but have not been able to find a solid solution. I am now quite stuck, does anyone have any idea how can I proceed and figure this out? I would sincerely appreciate it.

Thanks a lot

depth of coverage correlation R • 4.7k views

ADD COMMENT • link updated 6.6 years ago by zx8754 12k • written 6.6 years ago by Ana ▴ 200

0

Entering edit mode

you need to match the columns. Down sample larger dataset.

ADD REPLY • link 6.6 years ago by cpad0112 21k

score 1 · Answer 1 · 2018-05-08

1

Entering edit mode

6.6 years ago

Nicolas Rosewick 11k

As you have multiple value per each genomic position you might try to use the median or the mean per position. And then perfom the correlation. Here I will use the mean but you could use median or max depending on the question you want to answer.

library(tidyverse)

homozygotes.mean <- homozygotes %>%
  group_by(chrom,pos) %>%
  summarise(dp=mean(dp))

heterozygotes.mean <- heterozygotes %>%
  group_by(chrom,pos) %>%
  summarise(dp=mean(dp))

Then only take genomic positions found in both dataframes. For this you can use GenomicRanges :

library(GenomicRanges)

homozygotes.mean.gr <- GRanges(homozygotes.mean$chrom,IRanges(homozygotes.mean$pos,homozygotes.mean$pos))
heterozygotes.mean.gr <- GRanges(heterozygotes.mean$chrom,IRanges(heterozygotes.mean$pos,heterozygotes.mean$pos))
ov <- findOverlaps( homozygotes.mean.gr, heterozygotes.mean.gr)
homozygotes.mean.common <- homozygotes.mean[queryHits(ov),]
heterozygotes.mean.common <- heterozygotes.mean[subjectHits(ov),]

Then you can perform the correlation

cor.test(homozygotes.mean.common$dp,heterozygotes.mean.common$dp)

ADD COMMENT • link 6.6 years ago by Nicolas Rosewick 11k

0

Entering edit mode

i guess this is good only if genomic positions are equal in number in both the data (hetero and homozygotes).

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

No it works for different number of homo and heterozygote genotyopes. findOverlaps only reports positions that are in both dataframes and filter out the others.

ADD REPLY • link 6.6 years ago by Nicolas Rosewick 11k

0

Entering edit mode

okay. got it. code is taking the mean of multiple DP values (if there are any) for any given coordinates and then considers only the shared genomic coordinates between the two frames, leaving out unmatched genomic coordinates. correct?

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

Yes that's it. In this example I used the mean but I could also use max or median

ADD REPLY • link 6.6 years ago by Nicolas Rosewick 11k

0

Entering edit mode

I guess there is a typo in these lines:

IRanges(homozygotes.mean$pos,homozygotes$pos) vs IRanges(heterozygotes.mean$pos,heterozygotes.mean$pos)

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

Yes thanks I edited now.

ADD REPLY • link 6.6 years ago by Nicolas Rosewick 11k

0

Entering edit mode

I actually have only one potion in each data-set. Eventually I need to fix the script and run to for 1800 files. Each file only has one position. So I am guessing I do not need the second part of your code. However, when I tried cor.test(homozygotes.mean$dp,heterozygotes.mean$dp), I get an error message :

Error in cor.test.default(homozygotes.mean$dp, heterozygotes.mean$dp) : not enough finite observations

Any idea how to solve it? thanks

ADD REPLY • link 6.6 years ago by Ana ▴ 200

0

Entering edit mode

for detailed explanation: https://stackoverflow.com/questions/24659183/r-cor-test-not-enough-finite-observations copy/pasted:

If your vectors do not contain enough non-NA values (less than 3), the function will return the error.

up vote the OP in SO. Also check for NA values in your dataset.

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

Yes, already found this. But, my data-set does not have any NA. I am not quite how this going to work. By calculating the mean I get one mean value for 0|0 genotype and one vale for heterozygote genotype. So, I will have only 2 values for the co.test!

ADD REPLY • link 6.6 years ago by Ana ▴ 200

0

Entering edit mode

Please look into failing data frame dimensions or the length of dp vector post calculating mean/median/max/min.

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

Not quite sure if I get your point. The error says I need to have at least 3 non-NA values for the correlation test. cor.test(homozygotes.mean.common$dp,heterozygotes.mean.common$dp) this computes correlation between only 2 values, which I guess is the reason of error.

ADD REPLY • link 6.6 years ago by Ana ▴ 200

0

Entering edit mode

minimum of 3 non-NA values in data (each), as I understand.

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

No, I do not think so. I get this error from running cor.test and I run the cor.test between homozygotes.mean.common$dp (=7.61) and heterozygotes.mean.common$dp (=6.5). So only 2 values. I am not sure of this is the best approach to proceed! I am trying to find if there is any way to compute correlation for non-equal data frames, not taking the mean of each data frame and estimate correlation from the means.

ADD REPLY • link 6.6 years ago by Ana ▴ 200

0

Entering edit mode

> hom
  chrom       pos dp  ind_id genotype_id
1     1 115258827 12 HG00099         0|0
2     1 115258827  8 HG00101         0|0
3     1 115258827  6 HG00103         0|0
4     1 115258827  2 HG00114         0|0
5     1 115258827  8 HG00115         0|0
6     1 115258827  8 HG00128         0|0
> het
  chrom       pos dp  ind_id genotype_id
1     1 115258827  5 HG00100         0|1
2     1 115258827  5 HG00133         0|1
3     1 115258827  5 HG00138         0|1
4     1 115258827  2 HG00160         1|0
5     1 115258827  4 HG00232         1|0
6     1 115258827  9 HG00251         1|0

> cor.test(hom$dp[1],het$dp[1])
Error in cor.test.default(hom$dp[1], het$dp[1]) : 
  not enough finite observations

> cor.test(hom$dp[1:2],het$dp[1:2])
Error in cor.test.default(hom$dp[1:2], het$dp[1:2]) : 
  not enough finite observations

> cor.test(hom$dp[1:3],het$dp[1:3])

    Pearson's product-moment correlation

data:  hom$dp[1:3] and het$dp[1:3]
t = NA, df = 1, p-value = NA
alternative hypothesis: true correlation is not equal to 0
sample estimates:
cor 
 NA 

Warning message:
In cor(x, y) : the standard deviation is zero

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

Yes, I see you point. When to increase the dimension to 4 you will not get the error. But first, please note that the solution that Nicolas suggested computes correlation between the mean of DP for hom and het. So, it end up having 2 values and cor.test does not work for it. Second, the length of hom and het is not equal, so I will get an error message if I want to do cor.test(hom$dp,het$dp)

ADD REPLY • link 6.6 years ago by Ana ▴ 200

0

Entering edit mode

I compute the mean value per genomic position. So you will end with two dataframes with the genomic position common to both original dataframes. Each position containing the mean of the DP values for this position. Thus not one value...

ADD REPLY • link 6.6 years ago by Nicolas Rosewick 11k