I wish to determine a significance threshold for a small eQTL study.
I computed p-values for the observed data under the model,
Y ~ A
where Y is gene expression and A is the snp.
I then generated a 1000 permutations of Y and recomputed the p-values. I wish to use these permutations to calculate a cut-off but am not at all certain how to go about it.
Calculate the single lowest P value for a given gene across all SNPs
For X permutations, scramble genotypes and re-calculate lowest P value across all scrambled SNPs. Sort this list from low to high.
For each gene, adjusted P value is set to the rank of observed in the permutation list divided by X
Use adjusted P values as input to qvalue to estimate FDR for a given adjust P value cutoff.
This method was suggested by John Storey, author of (among other things) the qvalue package.
This gives you the estimated q value for accepting permutation P values as strong or stronger than a given cut-off and side-steps the issue of how to calculate multiple loci affecting the same gene.
What I always still wonder with this kind of FDR approach is that it sounds like you're very very tough on getting your Null distribution by always taking the most negative (as in worse for your FDR) from each permutation and using this in your ranking. But indeed this seems to be the standard way of doing it.
The first part of the procedure answers the question, "how likely is it I would see a statistic this strong or stronger by chance, given that I'm considering every SNP?". The second part answers the question, "If I did that process for 15,000 genes, what would be the false discovery rate for declaring a given P value significant?" I'm sure there are other statistically defensible methods one could devise, but a less conservative method would require clear justification.
For each gene, adjusted P value is set to the rank of observed in the permutation list divided by X"
rank of what observed? Do you mean the gene's rank in the p value-sorted permutation list?
Also, if you run say, 10 permutations, then you have ten permutation lists
so which adjusted P do you use? the average of all ten adjusted P values?
ADD REPLY
• link
updated 2.7 years ago by
Ram
44k
•
written 9.8 years ago by
bmpbowen
▴
40
1
Entering edit mode
Let's assume you're using linear regression for your statistic. The observed value is the absolute value of the F statistic for your regression (or the corresponding P value, which costs extra time to calculate so I generally don't work it out until the end) calculated using the true, observed genotypes and gene expression. The permutation P values are any number of F statistics that I calculate after scrambling the relationship between genotypes and phenotypes. The F statistic chosen for a particular gene is the single most extreme (i.e. the strongest) value I saw by chance after considering all of the SNPs. If my observed F is 4, meaning that the strongest F statistic for a given gene-SNP pair in the real data was 4, and my 10 permutation F values are [7,5,3,2,1,1,1,1,1,1] then the rank of my observed F is 3 and the permutation P value is 3/10 = 0.3. Note that this rank is sorted from highest to lowest, and an observed F statistic of -9 would be stronger than a permutation F statistic of 7. If anything is still unclear I urge you to read (Churchill & Doerge Genetics 1994) which has a lucid and short explanation of the idea.
I'm new to permutation analyses of eQTL data. When you suggest to scramble genotypes, is is sufficient to simply shuffle around the SNP labels, or would one have to actually shuffle around the genotypes? Meaning, if I have a matrix as follows:
SNP1 SNP2 SNP3
Person 1 AA CT GG
Person 2 AT CC GA
Would I shuffle like this:
SNP3 SNP1 SNP2
Person 1 AA CT GG
Person 2 AT CC GA
or like this:
SNP1 SNP2 SNP3
Person 1 AT CC GA
Person 2 AA CT GG
or both?
SNP3 SNP1 SNP2
Person 1 AT CC GA
Person 2 AA CT GG
ADD REPLY
• link
updated 2.7 years ago by
Ram
44k
•
written 9.8 years ago by
bmpbowen
▴
40
0
Entering edit mode
I scramble the indexes of the samples for the genotype matrix. This has the effect of scrambling the relationship between a particular person's true genotype and their gene expression, which I leave unchanged. Scrambling the genotype values would also work but for a human-sized (e.g. Affy 6.0) dataset is computationally inefficient.
Thanks! I ended up scrambling the index of samples for the gene expression, which I think achieved the same effect--namely, mixing which individual's genotype points to which individual's gene expression.
What if you have covariates that you need to include in your model? Would you scramble those as well? Or would you need your covariates to still "match" your permuted sample IDs? Or would you omit them altogether because once you shuffle the expression ID and the genotype ID, the covariates won't match no matter what? Hope that makes sense.
ADD REPLY
• link
updated 2.7 years ago by
Ram
44k
•
written 9.8 years ago by
bmpbowen
▴
40
1
Entering edit mode
At this point I think you should ask your local statistician or try the question out at Cross Validated, the stats version of Biostar. I don't have a definitive answer for you, but my intuition suggests that it will important to consider whether you are 1) testing for interactions between genotype, covariates, and expression or 2) simply treating covariates as additive effects in the regression model. If you try cross-validated, I suggest you look first for discussion of non-parametric regression. I would say go straight to a textbook but I'm not expert enough to give you a good steer there, and nothing replaces a knowledgeable statistician.
From what I know, if you have covariates in the model, you need to preserve the coupling between the covariates and the gene expression, so you just scramble the genotype matrix. Of course, in the end it depends on what you are trying to understand, but if the aim is to break the structure between the SNPs and the genes and covariates (which is what I think you are looking for if you are just treating the covariates as additive effects) , you would have to scramble the genotype only.
Hope this helps.
ADD COMMENT
• link
updated 2.7 years ago by
Ram
44k
•
written 9.8 years ago by
dharl
▴
70
0
Entering edit mode
Hi, I am a newbie in R programming and need to estimate permutation based threshold for miRNA expression analysis. I need to scramble the case-control status of dataset X affects generate 1000 permuted dataset and then run a glm model using another dataset (Y) where i have 300 miRNAs to find p-value for the model logit (case-control)=miRNA. The method described here sounds similar to what i want to do. Can anyone kindly share his/her R code with me please? Thank you
What I always still wonder with this kind of FDR approach is that it sounds like you're very very tough on getting your Null distribution by always taking the most negative (as in worse for your FDR) from each permutation and using this in your ranking. But indeed this seems to be the standard way of doing it.
The first part of the procedure answers the question, "how likely is it I would see a statistic this strong or stronger by chance, given that I'm considering every SNP?". The second part answers the question, "If I did that process for 15,000 genes, what would be the false discovery rate for declaring a given P value significant?" I'm sure there are other statistically defensible methods one could devise, but a less conservative method would require clear justification.
rank of what observed? Do you mean the gene's rank in the p value-sorted permutation list?
so which adjusted P do you use? the average of all ten adjusted P values?
Let's assume you're using linear regression for your statistic. The observed value is the absolute value of the F statistic for your regression (or the corresponding P value, which costs extra time to calculate so I generally don't work it out until the end) calculated using the true, observed genotypes and gene expression. The permutation P values are any number of F statistics that I calculate after scrambling the relationship between genotypes and phenotypes. The F statistic chosen for a particular gene is the single most extreme (i.e. the strongest) value I saw by chance after considering all of the SNPs. If my observed F is 4, meaning that the strongest F statistic for a given gene-SNP pair in the real data was 4, and my 10 permutation F values are [7,5,3,2,1,1,1,1,1,1] then the rank of my observed F is 3 and the permutation P value is 3/10 = 0.3. Note that this rank is sorted from highest to lowest, and an observed F statistic of -9 would be stronger than a permutation F statistic of 7. If anything is still unclear I urge you to read (Churchill & Doerge Genetics 1994) which has a lucid and short explanation of the idea.
Should this method work on cis-eqtls? Do you have a paper I can read for details of this methods?
I've taken this approach in
I performed cis-only analysis in human data in Quigley et al. Molecular Oncology 2013 (PMID 24388359)
I'm new to permutation analyses of eQTL data. When you suggest to scramble genotypes, is is sufficient to simply shuffle around the SNP labels, or would one have to actually shuffle around the genotypes? Meaning, if I have a matrix as follows:
Would I shuffle like this:
or like this:
or both?
I scramble the indexes of the samples for the genotype matrix. This has the effect of scrambling the relationship between a particular person's true genotype and their gene expression, which I leave unchanged. Scrambling the genotype values would also work but for a human-sized (e.g. Affy 6.0) dataset is computationally inefficient.
Thanks! I ended up scrambling the index of samples for the gene expression, which I think achieved the same effect--namely, mixing which individual's genotype points to which individual's gene expression.
What if you have covariates that you need to include in your model? Would you scramble those as well? Or would you need your covariates to still "match" your permuted sample IDs? Or would you omit them altogether because once you shuffle the expression ID and the genotype ID, the covariates won't match no matter what? Hope that makes sense.
At this point I think you should ask your local statistician or try the question out at Cross Validated, the stats version of Biostar. I don't have a definitive answer for you, but my intuition suggests that it will important to consider whether you are 1) testing for interactions between genotype, covariates, and expression or 2) simply treating covariates as additive effects in the regression model. If you try cross-validated, I suggest you look first for discussion of non-parametric regression. I would say go straight to a textbook but I'm not expert enough to give you a good steer there, and nothing replaces a knowledgeable statistician.