I have a cohort of patients (n=50), from each of them I have samples of tumor (n=1-5), I recieved variants that were found in each tumor, in average let there number be 1000. I want to compare data by nested ANNOVA in R. So far, I think the input dataframe should look like this:
Patient Sample SNP Variant_Freq
P1 S1 snp1...snp1000 vf1...vf1000
P1 S1 snp2 vf2
...
P1 S1 snp1000 vf1000
...
P50 S4' snp1000 vf1000'
Question1: Should I include "names" of SNPs as factor?
Question2: Should I include all observed snps around all patients into Variant_Freq (whereas snp that was not detected in the sample will have vf=0)? Or it`s better to include only snps that were found in each sample separately (no vf=0)?
Trying to understand how to do nested ANNOVA in R I came up with the following code:
mod1.lm<-lm(VF~Patiant+Morphotype+SNP+Patiant:Morphotype+Patiant:Morphotype:SNP,data=data)
mod2.lm<-lm(VF~Patiant+SNP+Patiant/Morphotype+Patiant:SNP+Patiant/Morphotype:SNP, data=data)
mod12.anova<-anova(mod1.lm,mod2.lm)
Question 3: Which of these formulae is a proper one, or neither?
Thanks for any suggestions!
EDITED
I tried to run model1 on real dataset with and without vf=0, and R gives Memory Error.
Question 4: Does it make sense to divide analysis by batches? If yes, how can I compare results for each batch?
What is variant frequency measuring? Is there a variability of the DNA in each sample? SNP could have several states, either the reference or a variant (one, two or three options) so vf=0 is important.
Also you should test each SNP individually if you want to find SNPs that are different between morphotypes (which what I assume you are trying to figure out, you didn't specify that).
Lastly, I would personally treat patientID as a random variable and use
lme4
orlmer
.Thank you for your reply, Asaf!
Variant frequency I got from VCF file after variant calling. It shows how many of reads has this variant among all reads that cover this position.
Yes, samples do very by DNA. They were taken from different parts of heterogeneous tumor.
Thank you for the remark about vcf=0 importance, I will take that into account.
I did not think in this way before, it actually might be a solution to the memory problem too... I will try to do that.
May be you are right, they were selected "randomly" as random can it be in research. I will read more about this packages.
In this case I would iterate over the SNPs and filter the table for each SNP then have a formula:
in
lme4
It's spelled anova (ANalysis Of VAriance), not annova.