Hi all,
I have studied many sources like this and this that try to relate the gene expression of a gene to the variants(SNPs). but in all of them, I have a question that they didn't answer. My question is this: As we have 3 types of genotype ( "0" which refers to 0 minor allele count (ref/ref), "1" refers to 1 minor allele count (ref/alt) , "2" refers to 2 minor allele count (alt/alt) ) , and if we just considered SNPs within 100 Kbp upstream and downstream of TSS(Transcription factor site) we may have about ~20 SNPs for each gene, so there would become so colinearity between nonindependent variables( which is genotype).
this is a sample table that I will run Linear Regression ( function "lm" in R) :
SNP1 SNP2 SNP3 SNP4 ... Gene expression
donor1 0 1 0 1 3.5
donor2 0 1 0 1 4.5
donor3 0 0 0 0 3.0
donor4 1 1 0 1 5.5
donor5 0 1 0 1 1.5
...
I have ~400 donors and many donors are like donor1 and donor5, their genotypes in SNPs are the same. so when I run linear regression this warning arise "prediction from a rank-deficient fit may be misleading"
so what should I do? Am I doing something wrong or no?
thanks alot
Can you show the model that you are fitting?
I am doing this :