Question

Unbalanced nested ANOVA in R: how to construct data frame and formulae properly?

0

Entering edit mode

2.0 years ago

vvs.hazia ▴ 10

I have a cohort of patients (n=50), from each of them I have samples of tumor (n=1-5), I recieved variants that were found in each tumor, in average let there number be 1000. I want to compare data by nested ANNOVA in R. So far, I think the input dataframe should look like this:

Patient Sample SNP Variant_Freq
P1 S1 snp1...snp1000 vf1...vf1000
P1 S1 snp2 vf2
...
P1 S1 snp1000 vf1000
... 
P50 S4' snp1000 vf1000'

Question1: Should I include "names" of SNPs as factor?

Question2: Should I include all observed snps around all patients into Variant_Freq (whereas snp that was not detected in the sample will have vf=0)? Or it`s better to include only snps that were found in each sample separately (no vf=0)?

Trying to understand how to do nested ANNOVA in R I came up with the following code:

mod1.lm<-lm(VF~Patiant+Morphotype+SNP+Patiant:Morphotype+Patiant:Morphotype:SNP,data=data) 
mod2.lm<-lm(VF~Patiant+SNP+Patiant/Morphotype+Patiant:SNP+Patiant/Morphotype:SNP, data=data)
mod12.anova<-anova(mod1.lm,mod2.lm)

Question 3: Which of these formulae is a proper one, or neither?

Thanks for any suggestions!

EDITED

I tried to run model1 on real dataset with and without vf=0, and R gives Memory Error.

Question 4: Does it make sense to divide analysis by batches? If yes, how can I compare results for each batch?

anova DNAseq SNP R • 1.2k views

ADD COMMENT • link updated 2.0 years ago by Ram 45k • written 2.0 years ago by vvs.hazia ▴ 10

1

Entering edit mode

What is variant frequency measuring? Is there a variability of the DNA in each sample? SNP could have several states, either the reference or a variant (one, two or three options) so vf=0 is important.

Also you should test each SNP individually if you want to find SNPs that are different between morphotypes (which what I assume you are trying to figure out, you didn't specify that).

Lastly, I would personally treat patientID as a random variable and use lme4 or lmer.

ADD REPLY • link 2.0 years ago by Asaf 10k

1

Entering edit mode

Thank you for your reply, Asaf!

What is variant frequency measuring?

Variant frequency I got from VCF file after variant calling. It shows how many of reads has this variant among all reads that cover this position.

Is there a variability of the DNA in each sample?

Yes, samples do very by DNA. They were taken from different parts of heterogeneous tumor.

SNP could have several states, either the reference or a variant (one, two or three options) so vf=0 is important.

Thank you for the remark about vcf=0 importance, I will take that into account.

Also you should test each SNP individually if you want to find SNPs that are different between morphotypes (which what I assume you are trying to figure out, you didn't specify that).

I did not think in this way before, it actually might be a solution to the memory problem too... I will try to do that.

Lastly, I would personally treat patientID as a random variable and use lme4 or lmer.

May be you are right, they were selected "randomly" as random can it be in research. I will read more about this packages.

ADD REPLY • link updated 2.0 years ago by Ram 45k • written 2.0 years ago by vvs.hazia ▴ 10

0

Entering edit mode

In this case I would iterate over the SNPs and filter the table for each SNP then have a formula:

lmer( VF ~ Morphotype + (1|Patient), data[data$SNP == snp,])

in lme4

ADD REPLY • link 2.0 years ago by Asaf 10k

0

Entering edit mode

It's spelled anova (ANalysis Of VAriance), not annova.

ADD REPLY • link 2.0 years ago by Ram 45k