Hi there,
I am running QC on some GWAS data on 48 samples that we are planning to do linkage analysis with. I have run the --het command in PLINK to check for excess heterozygostity and/or consanguinity.
For every sample I am getting a negative F statistic, indicating there is more heterozygosity that expected. But I'm wondering - how much heterozygosity is too much heterozygosity? Is there a threshold that would indicate sample contamination or is it solely based on the distribution amongst the 48 samples?
Here is an example of the -- het results file for a subset of the samples:
IID O(HOM) E(HOM) N(NM) F
1 292543 295900 388776 -0.03604
2 299349 302200 396946 -0.03016
3 298893 302000 396663 -0.03272
4 299188 302600 397491 -0.03591
5 298827 302200 396937 -0.0354
6 274894 283200 372565 -0.09318
7 298750 302500 397353 -0.03951
8 298737 302300 397082 -0.03761
9 299640 302600 397511 -0.03138
Any help would be greatly appreciated!
Thanks,
Caragh
Thanks for your help Ram!
According to this link:
"The estimate of F can sometimes be negative. Often this will just reflect random sampling error, but a result that is strongly negative (i.e. an individual has fewer homozygotes than one would expect by chance at the genome-wide level) can reflect other factors, e.g. sample contamination events perhaps."
Ram when you said these negative values are "negligible" I assume you meant that those individuals would not need to be removed.
However, I feel it would be helpful if you could provide a more objective definition of negligible (i.e. provide numbers)? At what values do these negative scores no longer become "negligible" (e.g. -0.5, -1.8 something else) and should be removed?
That's a wonderful question, and unfortunately I do not have a strong reason for picking a definite threshold. Part of it is owing to what we saw in the majority of the samples that were sequenced, and part of it was just practice that was carried over. We had a threshold of around
-0.2
. Samples with a lot of contamination would drop out in other steps of the pipeline as well, and by the time we got to measuring an F statistic, one would rarely see values cross-0.25
- the pipeline discarded samples at stages including sample prep, sequencing, and other metrics were also used pre-F stat QC stages. For example, we would use a filter to obtain a high quality set of variants that contributed to the F statistic (and other such statistics), so any underlying condition that affected the sample would strongly affect what we saw.All said and done, it was still subjective to an extent in that it worked for our consortium.
Thank you for that detailed response.
It makes sense that those samples suffering from contamination are dropping out in other (upstream) steps of the QC pipeline.
Furthermore, certain things we do in bioinformatics analysis can often be subjective in the sense that they are practices carried over (within a lab/consortium/sub-field) and not necessarily objective standards (i.e. benchmarking) but they just work