How to determine intra-run and inter-run repeatability in statistically sound manner?
1
0
Entering edit mode
23 days ago

How to determine intra-run and inter-run repeatability in statistically sound manner?

Let’s say I have VCF files with germline variants from samples A and B.

Run1: A_rep1, A_rep2, A_rep3, B_rep1, B_rep2, B_rep3,

Run2: A_rep4, A_rep5, A_rep6, B_rep4, B_rep5, B_rep6

I don’t have a “truth” set let’s just assume these are random whole-genome patient non-diseased samples.

For analytical performance I want to determine the variability within the run and between the runs per sample.

Does it make sense for me to compare the full pairwise-matrix table and then average it out for precision/recall etc...

Inter-run example (Run1):

Comparisons

A_rep1 > A_rep2

A_rep1 > A_rep3

A_rep2 > A_rep3

But also, I complete the table

A_rep2 > A_rep1

A_rep3 > A_rep1

A_rep3 > A_rep2

Essentially, since we don’t know which false positives (FP) are real, we need to treat them all equally. Ultimately, this will average out the signal, and then the average precision will equal the average recall.

But I don’t really know if this makes sense or how else to approach the problem given no ground truth.

Other alternatives:

· Take all shared positives across all 3-samples as “the truth” and then use that for each sample. But one bad sample will throw all the rest off (but maybe that’s the point of the measurement). But maybe this conservative approach is more appropriate?

· Jaccard-index

Any thoughts/suggestions from people experienced in this would be greatly appreciated.

Thanks.

Validation • 401 views
ADD COMMENT
1
Entering edit mode
22 days ago
LChart 4.7k

A couple preliminary notes:

  1. If this is human, you have somewhat more information than just samples A and B, you also have dbSNP, large genotype panels (HRC or 1000G), and genome-in-a-bottle regions.

  2. Most of your discordances with regards to variant calls and genotype assignments will be driven by coverage, so you can ask a surrogate question as to whether differences in coverage are independent across run and replicate, or whether different runs produce different coverage profiles - particularly regions of low coverage.

Now, even without ground truth you can always ask questions about variability. You have 3 pseudo-variances (obviously the outcome is binomial - called/not called or 00/01/11) - biological variance, inter-run variance, and assay variance. You can transform this into "precision" or "recall" or "concordance at variable sites" - but the analysis is similar - the question will be whether inter-run variance is larger -- and by how much -- than the assay variance. And if the inter-run variance is anything close to the biological variance (between samples) then there are probably serious issues with the assay...

As you say, this is is "averaging out" the information, but this kind of partial pooling is how variance partitioning actually works. It's the appropriate thing to do.

ADD COMMENT

Login before adding your answer.

Traffic: 2106 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6