Are there any statistical method is suitable to solve
1
0
Entering edit mode
3.5 years ago
octpus616 ▴ 120

Hi,

I recently pay attention to study the effect of specific genomic variation on gene expression. However I have some confusion about statistical analysis.

Firstly, please allow me use some simulated data to describe the situation I encountered. I have apply RNA-seq on different samples and get a expression matrix like this:

set.seed(1234)
exp_matrix <- replicate(10, rnorm(4))
rownames(exp_matrix) <- paste0("gene_", letters[1:dim(exp_max)[1]])
colnames(exp_matrix) <- paste0("sample_", letters[1:dim(exp_max)[2]])
exp_matrix

         sample_a   sample_b   sample_c    sample_d   sample_e   sample_f   sample_g
gene_a -1.2070657  0.4291247 -0.5644520 -0.77625389 -0.5110095  0.1340882 -0.6937202
gene_b  0.2774292  0.5060559 -0.8900378  0.06445882 -0.9111954 -0.4906859 -1.4482049
gene_c  1.0844412 -0.5747400 -0.4771927  0.95949406 -0.8371717 -0.4405479  0.5747557
gene_d -2.3456977 -0.5466319 -0.9983864 -0.11028549  2.4158352  0.4595894 -1.0236557
         sample_h   sample_i   sample_j
gene_a -0.0151383 -0.7094400 -2.1800396
gene_b -0.9359486 -0.5012581 -1.3409932
gene_c  1.1022975 -1.6290935 -0.2942939
gene_d -0.4755931 -1.1676193 -0.4658975

At the same time, I have a matrix that describes whether there is a interesting variation model on a specific gene region in a specific sample(if hit, noted as 1), Its look like this:

phe_martix <- replicate(10, ifelse(sample(16, 4) > 2, 0, 1))
rownames(phe_martix) <- paste0("gene_", letters[1:dim(exp_max)[1]])
colnames(phe_martix) <- paste0("sample_", letters[1:dim(exp_max)[2]])
phe_martix
           sample_a sample_b sample_c sample_d sample_e sample_f sample_g sample_h sample_i
gene_a        1        0        0        0        0        1        0        0        0
gene_b        0        0        0        1        0        0        0        0        0
gene_c        0        1        0        0        1        0        0        1        0
gene_d        0        0        0        0        0        0        0        0        0
       sample_j
gene_a        0
gene_b        0
gene_c        1
gene_d        0

Please note that the our interesting variation model is common across sample,But for specific genes, it is rare, so I am unlikely to get duplicates at the gene level.

I have three questions here:

  1. Is there any way to statistical method to inference that does this variation lead to up-regulation or down-regulation of gene expression?
  2. For specific hit genes, can I infer whether it has significant difference to background (mean) expression? For example, the gene_b of sample_d is a hit gene, does its expression levels higher or lower than background expression (noted as 0)?
  3. Are the one sample z-test or t-test is suitable to solve Q2?

This is may a layman-like question. But I really confused for it. So I hope hear your kindly advice.

Thanks for you reading.

Best.
Zhang

z-test • 1.2k views
ADD COMMENT
3
Entering edit mode
3.5 years ago
Michael 55k

I think you could be looking for expression quantitative trait loci (eQTL).

Here is a short overview article.

But I am not sure if that concept is applicable to your condensed data. You would need the original genotyping information to calculate an association test between variants and gene expression phenotype, not just whether there is any "interesting" (by what means?) variant in a gene (e.g. a SNP). Instead, you have to let the association test decide about variants that may be linked to gene expression. So the answer to Q1 is "yes, association testing".

There is an R-package called Matrix eQTL you could try. If you can convert your data into its input format, that should be fine, but I don't know if you have enough samples to achieve good power to detect anything.

For Q2 and Q3 you needed to define "background expression" first. In principle you could use your binary variable "hit/no hit" to divide samples into two groups (per gene) and do a test for differential expression. What is best for that purpose depends on how the expression values were obtained. A wilcoxon.test would always work but may be severely underpowered with your example data. On the other hand, standard DGE methods like DESeq2 use sample groupings that are the same for each gene and are therefore not applicable.

Edit: here is some R-code, thanks to the reproducible example you gave, that calculates t.test and wilcox.test for each gene, based on the grouping by phenotype.

set.seed(1234)
exp_matrix <- replicate(10, rnorm(4))
rownames(exp_matrix) <- paste0("gene_", letters[1:dim(exp_matrix)[1]])
colnames(exp_matrix) <- paste0("sample_", letters[1:dim(exp_matrix)[2]])

phe_martix <- replicate(10, ifelse(sample(16, 4) > 2, 0, 1))
rownames(phe_martix) <- paste0("gene_", letters[1:dim(exp_matrix)[1]])
colnames(phe_martix) <- paste0("sample_", letters[1:dim(exp_matrix)[2]])

# could also filter the matrix to only include rows with > 2 integration events
mat <- cbind(exp_matrix, phe_martix)
# run the combined matrix through the tests, use try because of lacking samples
ret <- apply(mat, 1, function(x){
    exp <- x[1:ncol(exp_matrix)] 
    phe <- as.factor(x[(ncol(exp_matrix)+1):(2*ncol(exp_matrix))])
    list (
           t.test=try(t.test(exp ~ phe)), 
           w.test=try(wilcox.test(exp ~ phe)))
})
## returns a list of the test results for each gene
ADD COMMENT
0
Entering edit mode

Hi, Dondrup

Thanks for your kindly help, The content you just shared has inspired me a lot, I am learning about eQTL now. Thanks again.

Meanwhile, please allow me explain the Q1 and "interesting" more. In fact, I think using real data and real scientific questions in the community will be more convenience to communication, But some of my collaborators seem to have different views on this, so I am sorry that I can only continue with a modeling problem for now. Consider a situation where a retrovirus infects a cell and integrates into the genome. We found that this integration has a certain tendency, but it is still widely scattered around many loci. Further, we called the SV from WGS data and marked the genes adjacent to the integration region as 1 in matrix (phe_martix). We further assume that this integration has a positive effect on transcription, and hope to get further evidences.

I hope I can have chance to get your more reply.

Best.

Zhang.

ADD REPLY
1
Entering edit mode

I have edited my answer, I hope it helps.

ADD REPLY
0
Entering edit mode

Hi, Dondrup

Thanks for your clearly code, It really provide a feasible way to filter significate genes, although it may seems to be less power due to the sample is lacking. This is indeed the most reliable method we known. Thanks for your share.

Best.

Zhang.

ADD REPLY
0
Entering edit mode

Ok, I see, that is a fundamentally different type of variation than I had thought.

ADD REPLY

Login before adding your answer.

Traffic: 2313 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6