Background:
The law of independent assortment
states that heritable "things" leading to phenotypes segregate more or less randomly over time. Deviations from the LoIA exist, and can be quantified (r2, D') this is referred to as Linkage Disequilibrium.
Linkage disequilibrium,
then, quantifies how often two "things" (could be SNVs; to simplify let's just say, SNVs from here on out) are found together. If they are found together more often than you would expect by chance alone, they are said to be "in LD with one another."
Genetic Recombination
happens on average once per chromosome per generation (approximate). Because chromosomes are low 10M to low 100M bps in length, this means that genetic variants nearby to one another may not segregate apart from one another for long periods of time (on average).
Exons
are typically not very long; 10s to low 100s of AA. They also tend to be fairly strongly conserved, to have good nucleotide variability, and to be fairly well-annotated. As a result, most genetic variants within a single exon are in tight LD with other nearby coding variants.
Short-Read sequencing
generates reads on the order of 100-500 base pairs. Phasing algorithms adapted for some types of (see supplemental note, below) short read NGS data leverage the fact that reads begin in different places to chain them together into a phased haplotype. However, considering the length of exon and the length of a read, for two variants on the same exon, there's a good chance a single read will span both variants. This leads us to case 0, which is an empty or trivial case:
Case 0: (trivial case) Phasing within an exon
Therefore, for variants within one and the same exon, you do not need phasing algorithm - you can just look at the individual reads by going back to the .bam, or even fasta file; and a well written script for doing this would likely be much faster and as accurate as a very sophisticated phasing algorithm like beagle or shape it or what have you.
Case 1: (meaningful case, somewhat do-able) phasing across nearby exons:
Because of Genetic Recombination
frequency (above), LD information can be used, in some cases, to establish exonic variants are in phase with other variants in phase with a known probability (which is quantified by LD). The dream scenario (not uncommon) is that two variants are in "perfect LD" meaning if rs101 is found, rs102 is always found, and vice versa.
Suppose rs101 is in exon1 and rs102 is in exon 2 of the WAJ gene (new gene; named after Wallenius, J et al.). Phasing algorithms like BEAGLE/SHAPEIT will not be able to phase them, because there are no supporting reads bridging the intervening introns that allow direct confirmation that two variants are on one and the same strand.
That notwithstanding, if they are in perfect LD across all global ancestries, it is highly likely there are in phase in your sample as well. This kind of information could be used to phase variants in exome-sequencing data even if you have no direct experimental evidence that they are found in phase.
- A benefit of doing this is you would not have to generate additional data (like the microarrays mentioned in other answer,
below, or long read (so-called 3rd gen sequencing) data.
- A drawback of this approach is that you have no direct experimental evidence to prove for certain those variants are found in phase in your samples. However, you would be able to estimate the probability you are in error ...
This kind of logic would work for exons or even genes nearby enough to one another ... e.g. if you were studying CD28 and CTLA4, for instance, that might be enough ... but depending on your research question(s) might not be. To go beyond this, you would need to have access to some other form of data (e.g. DNA microarray, WGS, etc. as mentioned by galaxy).
Edit - direct response to comment:
There are potentially a few things you could constitute strong evidence depending on your exact goals...
- First, check if there are recombination hotspots in the intervening space. May not totally cripple the approach, but will hurt, if there are.
- Second, in particular if not, you do not need microarray data. 5kb is definitely in the range of "close enough to potentially have significant enough LD" to use population data (rather than chaining NGS reads together).
- If you are interested primarily in pairs or small groups of variants, the easiest way to go would be to use a tool like
Plink
to calculate pairwise LD estimates in ancestry-matched individuals. Generating pairwise LD between all variants in both exons may help you build your case, depending ...
- You may be able to get access to either haplotype estimates or to 3rd gen sequencing (with concomitant 2nd gen for error correction) for an online database. Provided that those data are stratified by ethnicity and in the same disease/phenotype, you may also be able to make quite plausible claims "to investigate the possibility that ... "
- Caveats: you did not tell is if you were studying
somatic
or de novo
variation. In this case ... I would probably pursue other research questions.
- Parenthetical Comment If you are interested in this kind of thing, removing multiallelic sites may directly decrease your statistical power, depending on the reference data you end up using.
Case 2: Somatic variation
It is not exactly correct to say that somatic variants cannot be phased with germline variants using standard (NGS) assays. It's rather that they can't always be. In fact, a somatic variant will have a D' of 1.0 with nearby germline variants at the time it arises, if one can restrict to tumor cells. However, the problem is rather that this can't be distinguished using population data... However, here my suspicion is that analysis of the clonal structure of other daughter cells versus clones not containing the variant (in particular overlapping amplification and deletion events would be useful) could allow phasing of some somatic variants with high accuracy. Im not sure if there is published work on this or not.
Supplemental Note: There are two types of short-read sequencing (NGS) that have held an appreciable portion of market share: amplicon based and hybrid capture. It's definitely worth knowing a lot about each; they have different purposes, uses, etc, but its slightly beyond scope so I will just leave. a. few. links. Some phasing algos work by leveraging the fact that genomic reads do not always start in the same place .
Hi Vincent, thank you so much for your reply!
I think I understand your reasoning, however, if I understand it correctly, phasing variants between exons, in eukaryotes, still requires more than just IGV. Our data is Illumina so just 150bp long paired-end reads. At best they cover some hundreds of bps.
Specifically the variants I'm interested in are ~5 kbp apart.
What would you do in my stead?
See Edits, in particular edit to case 1 above.