Question

What is the difference between GRCh37 and hs37? And hg19?

8

Entering edit mode

6.3 years ago

juanfdelahoz ▴ 80

Hi! I've been struggling with the naming conventions of human reference genomes...

I know hg19 and GRCh37 are the same, but different names for each chromosome.

I know b37 is only the 25 longest sequences from GRCh37 (1-22,X,Y,MT)

I know we are now on the GRCh38 (or hg38) and we should be using that one.

However, for some reason, researchers in human genomes still use hg19...

Now, I found a reference called hs37 and I don't understand where it comes from. And there's not a single place where all this mess is explained. And all Heng Li says is: "If you map reads to GRCh37 or hg19, use hs37-1kg" : |

Other organisms have smaller communities and their genomes are better standardized, but humans... omg!

Thanks!

assembly reference genome hg19 hg38 hs37 • 17k views

ADD COMMENT • link updated 3.3 years ago by DavidStreid ▴ 90 • written 6.3 years ago by juanfdelahoz ▴ 80

1

Entering edit mode

juanfdelahoz not looking for grammar correction, but can you change "hg37" to "hs37" in title and tags?

ADD REPLY • link 6.3 years ago by cpad0112 21k

0

Entering edit mode

The title originally has hs37 that I changed to hg37. I've changed it back now.

ADD REPLY • link 6.3 years ago by Ram 44k

1

Entering edit mode

This is also an insightful piece from Heng Li:

http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

ADD REPLY • link 6.3 years ago by colindaven 7.0k

score 10 · Answer 1 · 2018-07-24

10

Entering edit mode

6.3 years ago

GenoMax 147k

While some of this is confusing for someone starting out new there is order to the seemingly arcane nomenclature.

GRCh38/hg38 is the current release of the human genome. You should indeed be using this since it has been around for ~5 years at this point. You can find the data for it at NCBI's GRCh38 site.

GRCh37/hg37 is synonymous with hg19. You can find the information about this release at NCBI's GRCh37 site.

hs37 is a special genome reference prepared for 1000 genomes project by this method. You can find that data here.

Ultimately GENCODE is the ~~organization~~ project responsible for managing human/mouse genome data. They provide the authoritative genome data that is used by everyone including NCBI/UCSC/Ensembl.

ADD COMMENT • link 6.3 years ago by GenoMax 147k

0

Entering edit mode

I recall there was an extensive discussion on differences between GRCh37 and hg19 somewhere. Pierre was involved, I think.

ADD REPLY • link 6.3 years ago by Ram 44k

0

Entering edit mode

Probably this is the sequence archive for hs ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/ and ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/

ADD REPLY • link 6.3 years ago by cpad0112 21k

0

Entering edit mode

Ultimately GENCODE is the organization responsible for managing human/mouse genome data. They provide the authoritative genome data that is used by everyone including NCBI/UCSC/Ensembl.

I believe you mean the Genome Reference Consortium manages the human and mouse genome data. GENCODE is an annotation group at EBI and is not part of the GRC, although the EBI is a member.

ADD REPLY • link 6.3 years ago by tdmurphy ▴ 230

0

Entering edit mode

Project is a better designation for GENCODE. Correction made above. GRC releases genome builds while annotation is produced by GENCODE project members.

ADD REPLY • link 6.3 years ago by GenoMax 147k

0

Entering edit mode

Why is hg17, hg18, hg19 followed by hg38 and not "hg20" as one would expect?

ADD REPLY • link 5.7 years ago by BioinformaticsLad ▴ 200

1

Entering edit mode

hg19 is equivalent to GRCh37. I recall reading somewhere that they decided to unify the version numbers for hg and GRCh conventions, and so now it is hg38/GRCh38.

ADD REPLY • link 5.7 years ago by Ram 44k

0

Entering edit mode

They should have gone one step further and unified the references as well!

ADD REPLY • link 5.7 years ago by BioinformaticsLad ▴ 200

1

Entering edit mode

There is only one reference sequence. There are annotations that come from different sources.

With graph based assemblies coming in near future reference sequences will gain a new complexity.

ADD REPLY • link 5.6 years ago by GenoMax 147k

1

Entering edit mode

The hg names were created by UCSC and reflect the versions that were included in their browser. The correct names for the assemblies were designated by the creators and were always NCBI36, GRCh37 etc. When GRCh38 came out, UCSC agreed that their system of changing the assembly names was confusing, and decided to go with the correct numbering, but ultimately stuck with their hg prefixes.

ADD REPLY • link 5.0 years ago by Emily 24k

0

Entering edit mode

I was under the impression there were slight differences. If it's just a different naming convention, are GRCh38 and hg38 interchangeable?

ADD REPLY • link 5.0 years ago by BioinformaticsLad ▴ 200

score 8 · Answer 2 · 2018-08-03

This is what I have found so far. Please correct me if I am wrong.

GRCh37 w/o patches includes the primary assembly (22 autosomal, X. Y, and non-chromosomal supecontigs) and alternate scaffolds, but not a reference mitogenome. Non-chromosomal supercontigs are the unlocalized and unplaced scaffolds.

The rCRS reference mitogenome in GRCh37 was included only after patch 2 (GRCh37.p2). This patch also included some fix and novel patches.

UCSC hg19 = GRCh37 w/o patches + African Yoruba mitogenome (not rCRS). Also UCSC hg19 has: Different naming conventions (e.g. chromosome X: chrX in UCSC vs. X in GRC). Different coordinate system (Start numbering a chromosome from 1 in UCSC vs. 0 in GRC).

Note also that Ion torrent uses a hg19 with replaced mitogenome (rCRS instead of Yoruba Sequence).

The b37 is hs37-1kg and does not include only the "25 longest sequences from GRCh37 (1-22,X,Y,MT)" but it is a 1000 Genome convention that includes: -The 24 "relatively complete" chromosomal sequences (named "1" to "22", "X" and "Y") downloaded individually from ENSEMBL. -The GRCh37.p2 (rCRS) mitochondrial sequence (named "MT") downloaded from MITOMAP or NCBI. -The unlocalized sequences, which were named after their accession numbers, such as "GL000191.1", "GL000194.1", etc. -The unplaced sequences, which were named after their accession numbers, such as "GL000211.1", "GL000241.1", etc. Only the alternate loci were not included in the b37 dataset.

hs37d5 (known also as b37 + decoy) was released by The 1000 Genomes Project (Phase II), which introduced additional sequence (BAC/fosmid clones, HuRef contigs, Epstein-Barr Virus genome) to the b37 reference to help reduce false positives for mapping. Note that this one uses the primary assembly of GRCh37.p4 (not the one of GRCh37 w/o patches).

As for hs37 (without -1kg) I think it is generated only by bwakit in BWA and according to their manual it corresponds to b37+EBV (Epstein-Barr Virus genome). EBV genome is also found in hs37d5 and GRCh38 and it is included because it is used in molecular biology for transformations and because it naturally infects B cells in ~90% of the world population.

There is no hg37.