First near-complete genome assembly of the hexaploid bread wheat genome was just published.
Hybrid data source (Illumina and PacBio). The final assembly contains 15,344,693,583 bases and has a weighted average (N50) contig size of 232,659 bases.
Data:
The first data set consisted of 7.06 billion Illumina reads containing approximately 1 trillion bases of DNA. The Illumina reads were 150-bp, paired reads from short DNA fragments, averaging 400 bp in length. Using an estimated genome size of 15.3 Gbp, this represented 65-fold coverage of the genome. The second data set used Pacific Biosciences single-molecule (SMRT) technology to generate 55.5 million reads with an average read length just under 10,000 bp, containing a total of 545 billion bases of DNA, representing 36-fold coverage of the genome.
Data analysis is similarly spectular (and here is only a part of it). Assemblers used MaSuRCA, Celera Assembler
and FALCON
The total CPU time was ~470,000 CPU hours (53.7 years), which was only made feasible by running it on a grid with thousands of jobs running in parallel (the maximum number was 3,320) for some of the major steps. The total elapsed time was just over 5 months.
When combined with the earlier steps, the entire assembly process took 6.5 months!