Question

Can you not phase exome data?

0

Entering edit mode

2.3 years ago

Joel Wallenius ▴ 210

Hello!

I'm just getting acquainted with softwares that phase variant calls. I tried SHAPEIT2, and am planning to try Beagle and SHAPEIT4. I'm working with VCF files made from WES data.

My trial run with SHAPEIT2 gave random results, no confidence in the haplotyping whatsoever. Is the data bad, or is SHAPEIT2 bad?

I used bcftools to merge 21 unrelated VCF files into a single VCF, and then removed multi-allelic sites, and used 0/0 instead of missing data. Then I run SHAPEIT2 with the chr15 genetic map found in this archive:

https://ftp.ncbi.nlm.nih.gov/hapmap/recombination/2011-01_phaseII_B37/

Did I do anything wrong, or is the data just bad? Or is SHAPEIT2 bad?

Very grateful for help here, I'm new to the phasing scene. :-]

Joel

phasing shapeit wes exome beagle • 1.5k views

ADD COMMENT • link updated 2.3 years ago by LauferVA 4.5k • written 2.3 years ago by Joel Wallenius ▴ 210

1

Entering edit mode

2.3 years ago

4galaxy77 2.9k

Yes, it is possible if you have some array data as well:

We used SNP array and exome sequencing data from the UK Biobank on 454,378 individuals. For SNP array data, we excluded variants that were not used during a previous round of phasing, resulting in 670,423 SNP array sites. For exome sequencing data, we excluded variants that had an MAC of one or that were flagged has potentially having low quality by the machine learning approach described above, resulting in 15,845,171 exome variants. We then phased these array and exome datasets as follows. First, we built a haplotype scaffold by phasing SNP array data with SHAPEIT4.2.047, phasing whole chromosomes at a time. We then phased the exome sequencing data onto the array scaffold in chunks of 10,000 variants, using 500 SNPs from the array data as a buffer at the beginning and end of each chunk. A consequence of this process is that when a variant appears in both the array and exome datasets, it is the data from the array dataset that are used.

Ref

ADD COMMENT • link 2.3 years ago by 4galaxy77 2.9k

0

Entering edit mode

We absolutely do not have microarray data... on the other hand, they don't claim in that paragraph that phasing WES data is impossible without microarray data. Perhaps it's implied.

Thank you for replying! What would you do in my stead?

ADD REPLY • link 2.3 years ago by Joel Wallenius ▴ 210

0

Entering edit mode

see case 1 above and hope !

ADD REPLY • link 2.3 years ago by LauferVA 4.5k

score 4 · Accepted Answer · 2022-08-23

Background:

The law of independent assortment states that heritable "things" leading to phenotypes segregate more or less randomly over time. Deviations from the LoIA exist, and can be quantified (r2, D') this is referred to as Linkage Disequilibrium.

Linkage disequilibrium, then, quantifies how often two "things" (could be SNVs; to simplify let's just say, SNVs from here on out) are found together. If they are found together more often than you would expect by chance alone, they are said to be "in LD with one another."

Genetic Recombination happens on average once per chromosome per generation (approximate). Because chromosomes are low 10M to low 100M bps in length, this means that genetic variants nearby to one another may not segregate apart from one another for long periods of time (on average).

Exons are typically not very long; 10s to low 100s of AA. They also tend to be fairly strongly conserved, to have good nucleotide variability, and to be fairly well-annotated. As a result, most genetic variants within a single exon are in tight LD with other nearby coding variants.

Short-Read sequencing generates reads on the order of 100-500 base pairs. Phasing algorithms adapted for some types of (see supplemental note, below) short read NGS data leverage the fact that reads begin in different places to chain them together into a phased haplotype. However, considering the length of exon and the length of a read, for two variants on the same exon, there's a good chance a single read will span both variants. This leads us to case 0, which is an empty or trivial case:

Case 0: (trivial case) Phasing within an exon Therefore, for variants within one and the same exon, you do not need phasing algorithm - you can just look at the individual reads by going back to the .bam, or even fasta file; and a well written script for doing this would likely be much faster and as accurate as a very sophisticated phasing algorithm like beagle or shape it or what have you.

Case 1: (meaningful case, somewhat do-able) phasing across nearby exons:

Because of Genetic Recombination frequency (above), LD information can be used, in some cases, to establish exonic variants are in phase with other variants in phase with a known probability (which is quantified by LD). The dream scenario (not uncommon) is that two variants are in "perfect LD" meaning if rs101 is found, rs102 is always found, and vice versa.

Suppose rs101 is in exon1 and rs102 is in exon 2 of the WAJ gene (new gene; named after Wallenius, J et al.). Phasing algorithms like BEAGLE/SHAPEIT will not be able to phase them, because there are no supporting reads bridging the intervening introns that allow direct confirmation that two variants are on one and the same strand.

That notwithstanding, if they are in perfect LD across all global ancestries, it is highly likely there are in phase in your sample as well. This kind of information could be used to phase variants in exome-sequencing data even if you have no direct experimental evidence that they are found in phase.

A benefit of doing this is you would not have to generate additional data (like the microarrays mentioned in other answer, below, or long read (so-called 3rd gen sequencing) data.
A drawback of this approach is that you have no direct experimental evidence to prove for certain those variants are found in phase in your samples. However, you would be able to estimate the probability you are in error ...

This kind of logic would work for exons or even genes nearby enough to one another ... e.g. if you were studying CD28 and CTLA4, for instance, that might be enough ... but depending on your research question(s) might not be. To go beyond this, you would need to have access to some other form of data (e.g. DNA microarray, WGS, etc. as mentioned by galaxy).

Edit - direct response to comment: There are potentially a few things you could constitute strong evidence depending on your exact goals...

First, check if there are recombination hotspots in the intervening space. May not totally cripple the approach, but will hurt, if there are.
- Second, in particular if not, you do not need microarray data. 5kb is definitely in the range of "close enough to potentially have significant enough LD" to use population data (rather than chaining NGS reads together).
If you are interested primarily in pairs or small groups of variants, the easiest way to go would be to use a tool like Plink to calculate pairwise LD estimates in ancestry-matched individuals. Generating pairwise LD between all variants in both exons may help you build your case, depending ...
- You may be able to get access to either haplotype estimates or to 3rd gen sequencing (with concomitant 2nd gen for error correction) for an online database. Provided that those data are stratified by ethnicity and in the same disease/phenotype, you may also be able to make quite plausible claims "to investigate the possibility that ... "
- Caveats: you did not tell is if you were studying somatic or de novo variation. In this case ... I would probably pursue other research questions.
- Parenthetical Comment If you are interested in this kind of thing, removing multiallelic sites may directly decrease your statistical power, depending on the reference data you end up using.

Case 2: Somatic variation It is not exactly correct to say that somatic variants cannot be phased with germline variants using standard (NGS) assays. It's rather that they can't always be. In fact, a somatic variant will have a D' of 1.0 with nearby germline variants at the time it arises, if one can restrict to tumor cells. However, the problem is rather that this can't be distinguished using population data... However, here my suspicion is that analysis of the clonal structure of other daughter cells versus clones not containing the variant (in particular overlapping amplification and deletion events would be useful) could allow phasing of some somatic variants with high accuracy. Im not sure if there is published work on this or not.

Supplemental Note: There are two types of short-read sequencing (NGS) that have held an appreciable portion of market share: amplicon based and hybrid capture. It's definitely worth knowing a lot about each; they have different purposes, uses, etc, but its slightly beyond scope so I will just leave. a. few. links. Some phasing algos work by leveraging the fact that genomic reads do not always start in the same place as can be seen here .