I am using FastQTL (http://fastqtl.sourceforge.net/) for eQTL mapping in our genotyping (~20K SNPs) & RNA-Seq data. It basically uses a VCF file for the genotypes and a BED file for gene expression levels.
For genome-wide analysis it works very well so far; but I am missing information for some regions when it gives errors as it uses the GSL for maximum likelihood estimations during the permutation step. The error is as below:
As I said, it doesn't occur frequently and for all the genetic regions. I checked the values in VCF and BED files many times, and they all seem OK. I analyze the genome in chunks, and if this error occurs in one of the chunks, I lose all the information in that chunk.
Since it says ''beta'', I think it might be about the beta functions, but I couldn't find more information to fix the problem. Any suggestions on this error?
Thank you very much for your reply. Indeed it could be due to the negative parameters, but as far as I know beta parameters are estimated using permutations (in FastQTL case, an adaptive permutation scheme) for generating a set of P-values (and using these values in a log-likelihood equation). So I don't think it would generate negative parameters. Maybe I can try updating the latest version of GSL for a better numerical stability.
Another option might be (somehow) disabling the ''default GSL error handler'' that aborts the computation for the related chunk and deletes all other information, what do you think?
From a quick look at the source code of beta.c, it fails because the function gsl_sf_lnbeta_sgn_e returns -1. This functions appears to check the sign and the amplitude of the parameters so this confirms that it's failing because FastQTL can't find adequate beta distribution parameters. You could suppress the error but would you trust the results ?
Thanks again for your help! Source code of beta.c explains all, you were right. Also I managed to contact the author of FastQTL paper, he also says that it's probably due to the low variability of some of the transcript expression levels between the samples. I've checked the failing ones, and I can confirm this. It's either due to many 0 values, or almost same expression levels for all samples.
Maybe if I can filter out these transcripts, the tool would work properly.
And regarding your last remark: I don't think this error is occurring for a big portion of my phenotype data, so it would be still OK to suppress the error somehow. As I said, even if it occurs for one transcript, it deletes all other mapped eQTLs in the same chunk.
Hi! Yes, I filtered out the transcripts that have low variability (= expression values of 0 for all the individuals) with an awk one-liner. Same goes for the genotypes - if they are always the same genotype for a variant - filter that variant as well. Then the tool works without an error. That wouldn't affect your results or create bias, since you just remove non-variable genotypes/phenotypes.
Let me know if that doesn't solve your problem, I will be happy to help further.
This solution is really helpful. I was running using filtered data, but then I used all the coding genes and got this error, so I think that this has something to do with lowly expressed data.
However, maybe you could try nominal option instead of permutation.
Can someone please explain in detail the meaning of the following section of the paper
2.2 Finding a candidate QTL per phenotype For simplicity, we will focus on a single molecular phenotype P quantified in a set of N
samples. Let G be the set of genotype dosages at L variant sites
located within a cis-window of 6 W Mb of the genomic location of P. To
discover the best candidate QTL for P, FastQTL measures Pearson
product-moment correlation coefficients between P and all L variants
in G, stores the most strongly correlated variant q [ G as candidate
QTL, and assesses its significance by calculating a nominal P-value pn
with standard significance tests for Pearson correlation.
How can this be affected by a test where only one SNP and its LD proxies (r=0.8) is tested.
Thanks!
I am not 100% sure, it's been a while since the last time I used FastQTL, but I will try to give it a try:
I think that, in theory, it should be OK to have a tagging SNP and its LD proxies for QTL Mapping. Because in FastQTL 2 different multiple-testing methods are used for correcting your nominal associations: permutation test for the best nominal candidate variant per phenotype and FDR estimation for multiple tested phenotypes; and that's for all the variants & phenotypes within (let's say) 1 Mb window of the TSS of a phenotype in the locus.
So if you have only one tagging SNP & LD proxies, most probably they will have a similar nominal association with the phenotypes (like they're forming a haplotype, creating a potential QTL for a locus); and I think their adjusted P-values will be also similar. FastQTL first detects the best variant for a phenotype - then reports the adjusted P-value of that variant by keeping genotypes of the cis variants the same but randomly permuting phenotypic values, therefore it estimates how frequently the new combinations are stronger than the observed association for a given nominal variant-phenotype (and, how likely it's a true association?). Here, the authors are recommending using Beta-approximation method (with 1000 permutations).
So in my opinion LD proxies shouldn't affect your results a lot in terms of the association of that SNP & its proxies with a phenotype. It might be even better to have them. Number of variants tested in that region would affect the process of finding the best nominal association for a phenotype: so, for example, it might be that another SNP that is not genotyped/imputed in the same region is the true QTL SNP for the phenotype (or let's say, having a stronger association with the phenotype). In other words, I think in this case you can only comment & speculate on potential involvement of your SNP & proxies as a QTL for a phenotype in your region of interest within the cis-window.
I hope it was clear enough, and sorry if it's not; let me know if you need any further help. Actually right now I am working on a similar case where I have a few tagging SNPs (genotyped) and its LD proxies (imputed) for a region that I am interested in, so I will give feedback and edit this post if required.
Thank you very much for your comment, this makes a lot more sense now.
I also agree that perhaps having all of the LD proxies is probably better since
"To discover the best candidate QTL for P, FastQTL measures Pearson product-moment correlation coefficients between P and all L variants in G" but I am not 100% sure.
Adding to what maegsul said. It's definitely much more to your advantage to have as many variants as possible. However, the way FastQTL works, based on the beta approximation, is that it also estimates the effective number of independent tests done. So if for a given gene, you have 1 SNP, and every other SNP within the cis window of that gene is also in LD with that SNP. During the beta approximation, that information will also be accounted for, and the effective number of independent tests will be just 1 (and not the amount of SNPs you have since they are all in LD). So you get to find among all the SNPs in LD the one with the strongest signal with the gene of interest, but not penalize yourself too much by not viewing all the other associations as being additional independent tests due to LD.
Thank you very much for your insight, very helpful. Can you please also demystify whether it is necessary to consider the bata approximated p-value:
For example, a SNP that is significant by the PermP but not by BETA would remain significant?
P PermP FWER BETA
Sure. You should use the beta approximated p-values. The reason for that is this, the p-values you get from permutation are limited by the amount of permutations you do. So even if there is a strong signal, your p-values will have a lower bound defined by the number of permutations performed. The beta approximation, gives you a better estimate of the p-value. To see if the beta approximation worked well, you should check the correlation between it and the permutation p values. It should be extremely close to 1 (not -1). So for instance, if you do 1k permutations, and that gene is truly an egene, the best (smallest) p value you can get is (0+1)/(1000 + 1). 1 is added so you don't get a value of zero. However, if your genome wide threshold is 1e-6 you will fail to reject the null (using just the permutation p value), even though the null would have been rejected if you did say 2e6 permutations (here the smallest p value 1/(2e6 +1) < 1e-6). The beta approximation provides a way for you to avoid running 2e6 permutations to get you the true p value.
Thank you very much for your answer mobius, I have a follow-up question. If I'm only testing a restricted number of SNPs going into the analysis e.g 240 will program computed the BETA p-value based on the number of SNPs in my vcf file (0.05/240= 0.0002083333 )? Or in some way compute a genome-wide BETA p-value?
Thank you very much for your answer mobius, I have a follow-up question. If I'm only testing a restricted number of SNPs going into the analysis e.g 240 will program computed the BETA p-value based on the number of SNPs in my vcf file (0.05/240= 0.0002083333 )? Or in some way compute a genome-wide BETA p-value?
No, in the sense that the number of snps you use only comes in the play in ascertaining what/where the strongest signal is, for that gene- the more the merrier as your variant set could just contain only proxies for the true signal, which in turn reduces the strength of the signal! The null hypothesis in the permutation is if the gene contains at least 1 eQTL... Of course as a side effect of how the test is done you obtain the strongest eQTL (eQTL in the sense that the association is significant, at your chosen cutoff/alpha level, after correction for multiple testing across the genes-more on that later) for that gene. Yes in the sense that you can choose how to control for the type I error rate using the Benjamini-Hochberg FDR/ Storey-qvalue/Bonferroni-FWER, based on the Beta p values. However, the control is at the gene level not based on the amount of snps you have. Hence the control will only be for the amount of tests (at the gene level!) you run in your specific data set, so you'll have to interpret your results with that in mind.
Going trough Ongen et al. 2016 I agree with you that control is at the gene level not based on the amount of SNPs.
However, I still struggle to understand why since this step is still considered "multiple testing".
Thank you so much for your help!
No problem :). Let me see if I can explain better. By this step I take it that you mean, the first step where you find the strongest signal? If so, consider a case where you have just one gene and say a hundred variants with independent effects on the gene's expression level. When you run all 100 tests (let's call all these tests a family), and you use a type I error rate of say 5%, then by chance alone you will reject 20 of those tests even if there's no signal (I.e each single test in that family is no longer at the 5% type I error rate you originally specified). Now when you examine the null ( I.e., does the gene have any variant associated with it) you will of course say yes, but your type I error rate has been inflated (for that group of tests performed), it's no longer at 5% for a single test in that group. To deflate it back your original level, you can use the FWER (family wise error rate), which then says that the type error rate has been inflated (for that group of hypotheses) by the number of tests you performed and so deflates it back to your pre-specified level (say 5%) by dividing 0.05 by 100 so your type one error rate for that family of tests you performed is now 5% for any single test in that family! Going back to the original question (does the gene have any (at least 1) variant associated with it), in the permutation scheme, you get to the root of the question when testing and so modify your approach accordingly. So instead of testing all the associations and correcting for the inflated type I error rate-which may be too stringent, you pick the observed estimate of the strongest signal (could be the strongest and also be significant (if you test it) just by chance) for that gene, then you generate the distribution of the estimates of the strongest signal's association with the gene under the null for that gene(I.e., that the gene does not have any variant associated with it, and so it shouldn't matter which samples have what genotype with respect to that gene) and then use the definition of a p value (the probability that you obtain any other "strongest" signal at least as extreme as the one you originally observed, under the null) to see if what you have is weird...weird is good, in this case. So by permuting you have implicitly corrected for the fact that you have used multiple variants to determine what the strongest signal is, recall that you generate the distribution of the strongest signal under the assumption that the null hypothesis is true and compare your observed strongest signal to that. Hope this helps!
Also note that, if you test all the pairs for that given gene, correct for multiple testing using FWER within that gene, then select the most significant gene-snp pair's p value after MT adjustment and then correct that adjusted p value across the genes tested you will get similar results. A possible reason for the results being similar and not exactly the same is that FWER control maintains the type I error rate but could be too stringent, why? Cause you're possibly also correcting for truly null tests and so penalize the non-null ones too much. However, you will find all the results from the FWER also showing up in the permutation approach but not vice-versa, due to the possibility I mentioned above.
Most likely your parameters for the beta distribution are wrong i.e. negative. This could be due to numerical stability issues in the algorithm.
I don't know how FastQTL works but from the paper:
so another possibility is that the beta distribution is not a good approximation for those cases.
Thank you very much for your reply. Indeed it could be due to the negative parameters, but as far as I know beta parameters are estimated using permutations (in FastQTL case, an adaptive permutation scheme) for generating a set of P-values (and using these values in a log-likelihood equation). So I don't think it would generate negative parameters. Maybe I can try updating the latest version of GSL for a better numerical stability.
Another option might be (somehow) disabling the ''default GSL error handler'' that aborts the computation for the related chunk and deletes all other information, what do you think?
From a quick look at the source code of beta.c, it fails because the function gsl_sf_lnbeta_sgn_e returns -1. This functions appears to check the sign and the amplitude of the parameters so this confirms that it's failing because FastQTL can't find adequate beta distribution parameters. You could suppress the error but would you trust the results ?
Thanks again for your help! Source code of beta.c explains all, you were right. Also I managed to contact the author of FastQTL paper, he also says that it's probably due to the low variability of some of the transcript expression levels between the samples. I've checked the failing ones, and I can confirm this. It's either due to many 0 values, or almost same expression levels for all samples.
Maybe if I can filter out these transcripts, the tool would work properly.
And regarding your last remark: I don't think this error is occurring for a big portion of my phenotype data, so it would be still OK to suppress the error somehow. As I said, even if it occurs for one transcript, it deletes all other mapped eQTLs in the same chunk.
Hi maegsul,
What did you end up using as a workaround? Did you filter out the transcripts? I'm encountering the same issue.
Hi! Yes, I filtered out the transcripts that have low variability (= expression values of 0 for all the individuals) with an awk one-liner. Same goes for the genotypes - if they are always the same genotype for a variant - filter that variant as well. Then the tool works without an error. That wouldn't affect your results or create bias, since you just remove non-variable genotypes/phenotypes.
Let me know if that doesn't solve your problem, I will be happy to help further.
This solution is really helpful. I was running using filtered data, but then I used all the coding genes and got this error, so I think that this has something to do with lowly expressed data.
However, maybe you could try nominal option instead of permutation.
Dear,
FastQTL (permutation pass) is displaying the following error message:
gsl: beta.c:44: ERROR: domain error Default GSL error handler invoked. Aborted
Any help will be appreciated.