Hi,
I recently pay attention to study the effect of specific genomic variation on gene expression. However I have some confusion about statistical analysis.
Firstly, please allow me use some simulated data to describe the situation I encountered. I have apply RNA-seq on different samples and get a expression matrix like this:
set.seed(1234)
exp_matrix <- replicate(10, rnorm(4))
rownames(exp_matrix) <- paste0("gene_", letters[1:dim(exp_max)[1]])
colnames(exp_matrix) <- paste0("sample_", letters[1:dim(exp_max)[2]])
exp_matrix
sample_a sample_b sample_c sample_d sample_e sample_f sample_g
gene_a -1.2070657 0.4291247 -0.5644520 -0.77625389 -0.5110095 0.1340882 -0.6937202
gene_b 0.2774292 0.5060559 -0.8900378 0.06445882 -0.9111954 -0.4906859 -1.4482049
gene_c 1.0844412 -0.5747400 -0.4771927 0.95949406 -0.8371717 -0.4405479 0.5747557
gene_d -2.3456977 -0.5466319 -0.9983864 -0.11028549 2.4158352 0.4595894 -1.0236557
sample_h sample_i sample_j
gene_a -0.0151383 -0.7094400 -2.1800396
gene_b -0.9359486 -0.5012581 -1.3409932
gene_c 1.1022975 -1.6290935 -0.2942939
gene_d -0.4755931 -1.1676193 -0.4658975
At the same time, I have a matrix that describes whether there is a interesting variation model on a specific gene region in a specific sample(if hit, noted as 1), Its look like this:
phe_martix <- replicate(10, ifelse(sample(16, 4) > 2, 0, 1))
rownames(phe_martix) <- paste0("gene_", letters[1:dim(exp_max)[1]])
colnames(phe_martix) <- paste0("sample_", letters[1:dim(exp_max)[2]])
phe_martix
sample_a sample_b sample_c sample_d sample_e sample_f sample_g sample_h sample_i
gene_a 1 0 0 0 0 1 0 0 0
gene_b 0 0 0 1 0 0 0 0 0
gene_c 0 1 0 0 1 0 0 1 0
gene_d 0 0 0 0 0 0 0 0 0
sample_j
gene_a 0
gene_b 0
gene_c 1
gene_d 0
Please note that the our interesting variation model is common across sample,But for specific genes, it is rare, so I am unlikely to get duplicates at the gene level.
I have three questions here:
- Is there any way to statistical method to inference that does this variation lead to up-regulation or down-regulation of gene expression?
- For specific hit genes, can I infer whether it has significant difference to background (mean) expression? For example, the
gene_b
ofsample_d
is a hit gene, does its expression levels higher or lower than background expression (noted as 0)? - Are the one sample z-test or t-test is suitable to solve Q2?
This is may a layman-like question. But I really confused for it. So I hope hear your kindly advice.
Thanks for you reading.
Best.
Zhang
Hi, Dondrup
Thanks for your kindly help, The content you just shared has inspired me a lot, I am learning about eQTL now. Thanks again.
Meanwhile, please allow me explain the Q1 and "interesting" more. In fact, I think using real data and real scientific questions in the community will be more convenience to communication, But some of my collaborators seem to have different views on this, so I am sorry that I can only continue with a modeling problem for now. Consider a situation where a retrovirus infects a cell and integrates into the genome. We found that this integration has a certain tendency, but it is still widely scattered around many loci. Further, we called the SV from WGS data and marked the genes adjacent to the integration region as 1 in matrix (phe_martix). We further assume that this integration has a positive effect on transcription, and hope to get further evidences.
I hope I can have chance to get your more reply.
Best.
Zhang.
I have edited my answer, I hope it helps.
Hi, Dondrup
Thanks for your clearly code, It really provide a feasible way to filter significate genes, although it may seems to be less power due to the sample is lacking. This is indeed the most reliable method we known. Thanks for your share.
Best.
Zhang.
Ok, I see, that is a fundamentally different type of variation than I had thought.