Question

hg19 vs hg38 in two pictures

5

Entering edit mode

10.1 years ago

Nancy Ouyang ▴ 170

Took these two screenshots recently, thought you all might enjoy them. Shows our evolving understanding of the complexity of the human genome at the population-scale, as reflected in the changes in the reference genome between hg19 and hg38.

(ignore the temp folder, not part of hg19)

Note: I make no claims as to actual accuracy or completeness of these two directory listings, I haven't taken a close look at them in a while. But they're interesting from a big-picture perspective.

hg19 / GRCh37: 2009
hg38 / GRCh38: 2013

hg19 hg38 • 29k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 10.1 years ago by Nancy Ouyang ▴ 170

Ram · Answer 1 · 2015-03-27

1

Entering edit mode

10.1 years ago

Brian Bushnell 20k

A lot of the stuff in the unplaced contigs is junk. I excluded all sequences other than primary chr1-22, X, Y, and M from JGI's human contaminant filtering due to too many false-positive hits, largely from fungi - in other words, many of those unplaced or alternate sequences are not human, but rather environmental contaminants or human pathogens.

I'm sure most of it is valid, but I would not rely on it.

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 10.1 years ago by Brian Bushnell 20k

1

Entering edit mode

I excluded all sequences other than primary chr1-22, X, Y,

By doing this you're creating false mapping positives. A read that could have been 100% mapped to chrUn_xxx_random could be mapped 90% to 'chr2' -> false positive called.

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 10.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I think you might be misunderstanding the point of the filter. It's to remove human reads from plant and fungal sequencing projects. Ignoring or masking parts of the human reference has absolutely no chance of creating false positives, only false negatives. In this case, a false positive is a fungal read that gets removed because it maps to the human reference - which would yield an incomplete assembly - and a false negative is a human read that passes through the filter. False negatives can be cleaned up later by mapping the assembly to nt, but the earlier you can remove contamination, the better.

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 10.1 years ago by Brian Bushnell 20k

1

Entering edit mode

You need to show proof to state such strong words. I have collaborated with GRC a few times. They are extremely careful about sequences added to the human reference genome. They routinely blasted against nt to check contaminations and are very familiar with GenBank. They told me which individual fosmid/bac are likely to be mislabeled as human. Given this, I can hardly believe contamination is a severe problem with the human reference genome. Actually if you check read mapping, they are mostly covered by human reads. Are you saying most human samples are contaminated with fungi? If your evidence is just fungi hits, they could be low-complexity sequences. Unplaced/unlocalized/alt sequences tend to be enriched with segmental duplications, repeats, satellites and low-complexity sequences. Because of this, many summary statistics on these sequences are very different from chromosomal sequences. Have you checked your fungi hits are not low-complexity? Are you sure they are truly not human contaminations?

If you have found a single piece of sequence that is susceptible to contamination, let GRC know. They will take it very seriously.

ADD REPLY • link 10.1 years ago by lh3 33k

0

Entering edit mode

Yes, I verified quite thoroughly that the contaminant sequences were long (several hundred bp), high-complexity, non-ribosomal, similar only to other fungal sequences but no vertebrate sequences, and exactly matched fungi that are known to infect humans.

It's not like the non-primary sequences are all entirely contaminant, or mostly contaminant; I would not say that contamination is a severe problem with the human reference. "A lot of the stuff in the unplaced contigs is junk" was probably an overstatement; I'll change that to "I found multiple unambiguously fungal sequences in the non-primary files and lost confidence in being able to use them for a filter that couldn't have any false positives." Each one took a long time to figure out whether it was a coincidental match, or human contamination in a fungal reference, or fungal contamination in the human reference, or both human and fungal references contaminated with a synthetic sequence - all three scenarios occurred.

Anyway, you're welcome to replicate this, it's just a huge amount of work. Mask the human reference to remove low complexity and ribosomal elements, download all fungal assemblies from Mycocosm, shred them into 100bp pieces, map them to human requiring ~98% identity, and look for where multiple shreds from the same fungus hit consecutively. Then blast those shreds, and either they will hit human, chimp, orangutan, (etc) and possibly the fungus in question - in which case it's human contamination of the fungal assembly - or they'll hit only human and a bunch of funguses, but no other vertebrates - in which case it's fungal contamination in the human genome.

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 10.1 years ago by Brian Bushnell 20k

0

Entering edit mode

I am not sure your procedure is conclusive. There are only a-few-genome-worth of primate sequences in nt. It is not hard to find some megabases of true human sequences having no hits to other primates. If your Mycocosm samples are contaminated by these human-specific sequences, you will find high identity to human genome but poor alignment to primates. As I said, non-chromosomal sequences tend to be duplications and repeats. Probably they also tend to be different from other primates. Then you will see non-chromosomal sequences enriched with "contaminations". A better approach to confirm these potential contaminations is to check whether they are present in other human resequencing data. If they are present, they are probably true human sequences, on the assumption that Mycocosm are not common contaminations in human resequencing.

Could you paste a couple of such sequences here? I can help to check if they are present in other human data. I am also pretty sure that GRC are eager to chase and fix all known contaminations in the human reference genome.

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 10.1 years ago by lh3 33k

0

Entering edit mode

Oh, interesting. There's still a lot of alt-ref-loci that are actually placed on chromosomes, even if you exclude all the ChrUn's -- have you found those to be suspect? For instance, my understanding was that for places like the KIR regions, hg19 smooshed them all into one-ish region, and hg38 instead let them exist as alt_ref_loci, which helps explain what the two screenshots are showing. See http://www.ncbi.nlm.nih.gov/gene/3803 (section title: RefSeqs of Annotated Genomes: Homo sapiens Annotation Release 107)

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 10.1 years ago by Nancy Ouyang ▴ 170

0

Entering edit mode

I didn't really spend a lot of time figuring out exactly which ones might be reliable, but the unplaced ones were definitely the worst.

ADD REPLY • link 10.1 years ago by Brian Bushnell 20k

Ram · Answer 2 · 2015-03-28

1

Entering edit mode

10.1 years ago

Giovanni M Dall'Olio 28k

I wonder if in the future we will have a reference markov model, or a PSSM of the human genome, instead of than a single sequence. The addition of the alternative loci in hg38 is already a step forward, but in the end, what sense does it make to work on a unique sequence for all human individuals? There is so much variation between human populations and people, that a reference sequence can't really represent uniformly.

ADD COMMENT • link 10.1 years ago by Giovanni M Dall'Olio 28k

1

Entering edit mode

I agree -- I feel like aligning reads to the reference genome is very much a bootstrapping process which gradually gets more and more "true" over time, sort of like how we built more and more precise machining tools over time (or how, if today we personally want bootstrap / build a machine shop from scratch, we can use a less-precise lathe to build a more-precise lathe). The aim isn't to represent all variation in a single genome, but to build a map that allows us to bootstrap until we are able to read each genome in a population as it really is. Then the whole population of genomes encompasses the (genomic) variation inside a population. In the meantime, the alt_ref_loci are aids to map reads to the correct place for highly variant regions in the population.

ADD REPLY • link 10.1 years ago by Nancy Ouyang ▴ 170

0

Entering edit mode

"This read was a 95% match to one of 12 possible alleles, each with a permutation factor of approximately ∂/0.3, in the Medieval European Human Genome."

God help us all :)

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 10.1 years ago by John 13k

0

Entering edit mode

A graph genome is being developed by different organizations right now (7 bridges and David Haussler at UCSC)

ADD REPLY • link 9.9 years ago by Ying W ★ 4.3k

0

Entering edit mode

Seven Bridges has announced Graph -- see https://www.sevenbridges.com/graph/ -- but of course it's not open source and pricing is not discussed on the website.

ADD REPLY • link 7.2 years ago by Owen S. ▴ 370