What are the advantages in using the T2T data as the reference genome right now? Is it able to provide extra (useful) variant calling information that GRCh38 is not able to?
What are the advantages in using the T2T data as the reference genome right now? Is it able to provide extra (useful) variant calling information that GRCh38 is not able to?
Onter,
The advantages of (T2T) CHM13v2 over (GRC) GRCh38.p14 are myraid and extend into over a dozen fields. The second question you ask is a minor subset of that first, huge question. To date, even dedicated peer reviewed manuscripts on this subject have struggled to present all the ways in which science, medicine, etc. can benefit from CHM13 (as well as reviews, perspectives or editorials). Ill outline a brief answer here, and point you to key resources that can help you further deepen knowledge of the area.
N.B. in this answer, I will use the terms 2nd generation sequencing, short-read sequencing and NGS interchangeably, and please note it is this technology that is largely the basis for the most recent GAIB builds e.g. GRCh38.p14. By contrast, it is understood herein that long-read sequencing, 3rd generation sequencing, and CHM13 "go together". The advantages of CHM13v2 over GRCh38.p14 are largely predicated on the differences in technical abilities of these technologies, though certainly by no means limited to them.
First, it is helpful to begin with Eichler, Surti, and Ophoff's 2002 letter to the NHGRI in which they propose the use of a complete hydatidiform mole (CHM) as a genomic resource. This proposal, written in the wake of the first generation sequencing (Sanger) based Human Genome Project, outlines several key problems that first and second generation sequencing are ultimately unable to address, and that indeed remained problematic throughout the era in which NGS arose to hegemony (for arguments sake, lets say 2002-2021). These 3 arguments, below, follow the logic of that letter closely:
1) Gap closure in certain regions of the genome, in particular low complexity/highly repetitive regions beyond a certain length was a key reason for proposal of CHM. When such a region - for instance, an alpha satellite repeat, exceeds the length of the longest read in a NGS dataset, one loses the ability to span that region, leading to a gap in the reference. In this specific context the gap contains an unknown number of repeats, and any structural polymorphisms. But gaps in the reference are "bad" for many other reasons, but enumeration of all of them is prohibitive here. GRCh38.p14, the last in the series to date, still contains several hundred gaps.
2) Gene rich segments that contribute to human disease By the time Eichler et al. wrote the above letter, it was already clear that genes relevant to human disease phenotypes (e.g. autism) lay in segmental duplications that are 1) gene rich 2) polymorphic and 3) highly copy number variable. As it turns out, SDs also have very high homology - and are thus susceptible to additional genomic processes; e.g. gene conversion events. They also tend to be long (a single SD might be 150kb, far longer than the longest read in a NGS dataset). As such, short read data alone are essentially unable to provide certain kinds of insight into these loci. If these regions were unimportant, perhaps this would not be a big issue. But, we now know (see, for instance, Porubsky et al 2022) that tandem blocs of SD repeats display several characteristics that provide fundamental insights into multiple distinct fields in human biology. For instance, these loci are rapidly evolving (having achieved remarkable copy number gain in only the last 200 kya), tend to appear on the edges of large inversion variants, contain genes thought to be involved in speciation of humans from great apes, contribute to certain human phenotypes (like the aforementioned neurocognitive ones), and so on. To complete the argument, what I am driving at is that these regions cannot be adequately studied using GRCh38.p14.
3) Chromosomal Structural Polymorphisms We now know for the first time the degree of SV that previous builds, such as short-read sequencing based GAIB resources, misses. Inversion polymorphisms have to date been poorly ascertained - more than half have been missed. To return to SDs, they too have until now been poorly understood. The role of SV in regions of heterochromatin, like centromeres, telomeric and subtelomeric regions, the short arms of acrocentric chromosomes, (which contains human rDNA) is essentially in its infancy at present, because NGS, and therefore prior GRC assemblies, are almost totally blind to SV in regions of heterochromatin. One dramatic example is pertains to rDNA: CHM13v2 contains about 781% more rDNA sequence than GRCh38.p14. Therefore, biologists whose studies of health, disease, genomic fluidity, human evolution, etc., is impacted by ribosomal kinetics, for instance, have a rationale for use of CHM13v2.
OK, now lets shift gears and start considering implications.
De novo assembly with reference to an available pangenome will improve genomic alignment and variant calling - What does it mean to align genomic material to a reference genome that has gaps, inaccurate SV calls (both false positives and false negatives), etc? What problems can arise? Well, briefly, consider a read from a human genome that sequences a SV not found in the reference. Where would this read map to? Well, evidently nowhere. Thus, if one wanted to use that read in part of a large contiguous sequence, one would need to assemble it de novo. This is the approach that will be used in the human pangenome reference consortium, and there is an emerging consensus that this leads on average to increased sensitivity and recall without loss of specificity (to simplify, it is slightly better without a downside). This is just one of many examples, but flowing from 3), above, the idea is that because human genomes are sturcturally polymorphic, no reference genome can enable perfect alignment of all sequences. To hammer home this idea, consider that even short-read based sequencing studies identified about 700,000 bases of unique sequence per genome. Because this is based on short-read, NGS data, it is an underestimate; specifically, SVs in highly repetitive regions are most likely to be missed. It is these regions - which in fact contribute the most to differences between any two people - that are most poorly aligned and represented if alignment to a linear reference genome is used.
Use of gapless assemblies for studies of epistasis and haplotype effects the T2T consortium has now officially been absorbed into the HPRC. The goal of the HPRC is the creation of a human pangenome, to transcend and supercede the singular reference genome assembled by consortia such as GRC. A big part of this will be the ability to produce phased, gapless, diploid human genomes on a routine basis. The ability to provide chromosome length phase information is currently based on both short read sequencing data (Strand Seq; Falconer 2012) and long-read together, an important present limitation to future goals. As an example of a locus at which phase information could prove extremely important, consider the HLA locus. Here, the importance of haplotype effects is already established, and will likely continue to retain importance in transplant biology, study of immune syndromes, etc. However, despite this, the high levels of homology, SV, and CNV in these extensive haplotypes have proved problematic for short-read seq-based genomic references.
Finally, to touch on the variant calling part of your post, consider also increased imputation accuracy of SV based on SNV data, which would lend new life to GWAS studies to date, and perhaps assist efforts to fine-map causal variation for all diseases studied using DNA microarray and NGS data to date. I dont have a good citation for this yet.
There is so much this answer leaves out. I did not even mention genome graphs, and how they are advancing human genomics and powering third generation assembly algorithms. So, to close, please consider a few other treatments of this topic, e.g. Aganezov et al., or Paten et al.
Hope this helps.
VL
As one small example to supplement Vincent Laufer's excellent answer, we have found in our work that we can correctly identify the isotype of human plasma cells when using T2T-CHM13 as the reference for 10x Genomics Chromium profiling but not when using GRCh38.p14. GRCh38.p14 contains a few errors that result in some sequence reads being assigned to the wrong Ig gene. Using GRCh38, many cells appear to express two or more isotypes whereas with T2T-CHM13 every cell expresses just one isotype.
Gordon Smyth - fascinating. sounds like 2 papers are in the works!!
We've posted a preprint of our result now on bioRxiv:
Nie J, Tellier J, Tarasova I, Nutt SL, Smyth GK (2023). The T2T-CHM13 reference genome has more accurate sequences for immunoglobulin genes than GRCh38. bioRxiv https://doi.org/10.1101/2023.05.24.542206 (Posted 25 May 2023).
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
phenomenal job really great summary and overview!
one question i have that you might be able to shed some light on is whether is there any benchmarking for variant calling with short reads (e.g. Illumina)?
would be interested in seeing whether pushing to T2T would make sense in a clinical genetics setting or whether the additional genome sequence mainly benefits from long-read sequencing experiments
Definitely. Consider @gordonsmyth's answer below - more sensitive detection of plasma cell isotypes would have direct clinical impact for patients with plasma cell dyscrasias.
my suspicion is that this process will begin gradually, and will ultimately take decades before clinics switch over to pangenome references (at least in the united states. not sure where you are).
here, though adoption of genomic technology for clinical use is heavily regulated and restricted; in addition, it is very difficult to get insurance companies to reimburse testing for new indications. so, I think that while various labs will start this process piecemeal, it will likely take a very long time to become widespread.
but if you are more interested in could it be better than how significant the barriers are, then i think the answer is definitely yes it could be applied, starting now. If you are consider doing this in a clinical setting, i should very much like to hear more about your efforts. im not sure if DMs are possible on Biostars, but perhaps Twitter or email ? its just nice to hear about clinicians living on the cutting edge.
What do you think of the effects of T2T vs CRCh38 in transcriptomic?
Kristoffer Vitting-Seerup - lots, ranging in importance from mundane to sublime. I thought about mentioning that, but my answer above was starting to lose focus so I held off.
at present, SMRT sequencing (Pacific Bio) is already advertising use of HiFi reads in transcriptomic studies, and there are already a few high impact papers out that might give you some ideas about how it is being used at present. a search for "long read sequencing" "transcriptomics" should give it to you, if you can't find good articles let me know Ill dig them back up.
more generally, i think some low hanging fruit early on could include: