The libraries (Table 1) were prepared from DNA obtained from anonymous human donors in accordance with US Federal Regulations for the Protection of Human Subjects in Research (45CFR46) and following full review by an Institutional Review Board. Briefly, the opportunity to donate DNA for this purpose was broadly advertised near the two laboratories engaged in library construction. Volunteers of diverse backgrounds were accepted on a first-come, first-taken basis. Samples were obtained after discussion with a genetic counsellor and written informed consent. The samples were made anonymous as follows: the sampling laboratory stripped all identifiers from the samples, applied random numeric labels, and transferred them to the processing laboratory, which then removed all labels and relabelled the samples. All records of the labelling were destroyed. The processing laboratory chose samples at random from which to prepare DNA and immortalized cell lines. Around 5–10 samples were collected for every one that was eventually used. Because no link was retained between donor and DNA sample, the identity of the donors for the libraries is not known, even by the donors themselves. A more complete description can be found at http://www.genome.gov/10000921
I was not involved in the human genome project but I have heard off the record that as much as 80% of what wound up being in the final draft was from a single individual. Perhaps some others from the community can comment on this... It's difficult to comment on whether this genome would be viable in a real cell. The reference sequence is a haploid representation for one thing. There are other arbitrary aspects of it. Why do you ask?
The reference assembly is no longer just a haploid representation. The Genome Reference Consortium (http://genomereference.org) has developed an assembly model that allows for representation of more than one allele at complex loci. There are currently 8 representation of the MHC- the one used in the chromosome assembly and 7 others that are aligned to the chromosome but are scaffolds in the full assembly.
You can see the current tiling paths here: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/overlap/
About 70% of GRCh37 is from the RP11 library. Everything else if from about 50 other libraries.
Remember- an assembly is just a model of a genome, and in the case of the human reference assembly it does not model any one individual genome.
Great point. However, I have encountered many groups, projects, etc. using the reference genome sequence where the first thing they do is toss out those extra alleles at the outset to simplify the analysis. They are generally never given another thought. There must be great examples out there of studies making use of them though ... at least I hope there are. :).
In terms of next generation sequence alignment that is true. However, both NCBI and Ensembl annotate these sequences so we do have a sense of the gene content on these sequences. Currently, there are about 180 genes that are only available on the alternate and patch alleles. However, the current tool chain is set up for a single, haploid consensus which makes using these sequences difficult. I just gave a poster at AGBT discussing a couple of different approaches for using these: http://figshare.com/articles/AGBT2013_Poster/642431
@Malachi, thanks, great answer! I just asked out of curiosity ;-)