Will Grc38/Hg20 Be A Multiple Sequence Reference Genome?
1
4
Entering edit mode
12.0 years ago
William ★ 5.3k

Will GRC38/HG20 be a multiple sequence genome reference? In other words, will it incorporate common variation from population sequencing data ( the 1000 genomes project) ?

GRC 38 is planned to be released in the summer of 2013. http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/

This information is of use for everybody currently developing genome sequence analysis tools.

genome reference • 7.4k views
ADD COMMENT
12
Entering edit mode
12.0 years ago
deanna.church ★ 1.1k

GRCh38 will continue to have a primary assembly (that is the non-redundant haploid assembly) and alternate loci. If you look at the patch releases that are currently available, you will see they are tagged as either 'fix' or 'novel'. Fix patches will be incorporated into the primary assembly with the release of GRCh38 while the novel patches will move into other 'ALT_LOCI*' assembly units.

I'm not entirely sure what you mean when you ask 'will it incorporate variation from the population sequence data (the 1000 genomes project)'. We are certainly using data from this project to correct problems in the reference; particularly in places where a base in the reference is not seen in the 1000 genomes cohort. We are also looking very hard at the 'decoy' sequence used in the analysis pipeline to ensure that we can represent that sequence in the assembly.

ADD COMMENT
0
Entering edit mode

Do the novel patches cover the difference between the diploid and haploid sequence of a individuals genomes or do they cover the common genome variation found in the population? Do they cover the full spectrum of variation, SNP, Indel, CNV, SV? Is there a coordinate system to point to regions in the novel patches (they differ in length with CNVs /SV compared to the primary assembly).

ADD REPLY
3
Entering edit mode

The assembly is not trying to represent all known variants. To be honest, the hard an fast rules of when to create an alternate are still being developed, but operationally, we make an alternate when: * there is sufficient allelic diversity that you can't easily annotate the variation on a haploid reference (think MHC or MAPT): http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/region.cgi?name=MHC&asm=GRCh37.p10 * When a structural variant creates a new gene (APOBEC): http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/region.cgi?name=APOBEC&asm=GRCh37.p10 * When there is an insertion that adds a significant amount of sequence (>5Kb) so that adding this sequence is likely to help NGS alignment: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/region.cgi?name=REGION2&asm=GRCh37.p10 All sequences in the assembly have an accession.version, so they have a native coordinate system. Here is a variant in dbVar that has been placed on multiple assemblies and on a non-chromosomal sequence: http://www.ncbi.nlm.nih.gov/dbvar/variants/nsv7734/ Not really much different than an unplaced or unlocalized sequence. Of course, for the alts and patches, we do also release the alignments of the alternate sequence to the chromosome so you do have the chromosome context- for example, you could still report a cytogenetic band, which often gives folks enough context to 'know' where they are in the genome.

ADD REPLY
0
Entering edit mode

Ok, many thanks for the clarification.

ADD REPLY

Login before adding your answer.

Traffic: 2560 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6