A guide to modern genome assembly
1
1
Entering edit mode
3.3 years ago
predeus ★ 2.1k

Hello all,

I was wondering if there is a guide to modern genome assembly - that is, what often involves both short and long reads, as well as Hi-C sometimes? What tricks can be used for typical problems (contaminated reads, highly heterozygous genome, etc)?

Thank you in advance, as always

long-reads illumina genome-assembly • 2.2k views
ADD COMMENT
0
Entering edit mode

Maybe this could become a part of the of the Biostars Handbook?

ADD REPLY
0
Entering edit mode

That would be great - but it does seem like a lot of work :)

ADD REPLY
2
Entering edit mode
3.3 years ago
Michael 55k

There are some guidelines from the Earth BioGenome Project here: https://www.earthbiogenome.org/assembly-standards

These address quality requirements and standards rather than how to achieve them, but they also mention the Vertebrate Genome Project assembly pipeline. It might be useful to look at the tools VGP pipeline use and related publications, e.g. the Rhie et al., 2021 paper. Indeed, the Methods and Supplementary might sufficiently address the issues you mention.

What helps against highly heterozygous genome, btw? High levels of inbreeding over many, many generations (but that has maybe fallen out of fashion).

ADD COMMENT
1
Entering edit mode

What helps against highly heterozygous genome, btw? High levels of inbreeding over many, many generations

Could also consider sequencing haploid tissue/cells ?

ADD REPLY
0
Entering edit mode

Like eggs? That would make it way too easy :)

But even eggs are not all equal, especially when we need more than one to get enough DNA. Also, in some species, one sex has low recombination or none, that makes it easier as well.

Also, one has to take into account which tissues will be relatively free of contamination and deliver enough DNA, so we used ovaries, but still samples from multiple individuals had to be pooled. Here is what was done with the salmon lice parasites:

Inbred adult female Lepeophtheirus salmonis salmonis were sampled after 27 generations of inbreeding following established protocols for rearing lice on salmon and the lice were found to be homozygous for 12 out of 13 published microsatellite loci (Hamre, et al. 2009; Skern-Mauritzen, et al. 2013). To reduce the amount of non-salmon louse contamination before sequencing, DNA was purified from either dissected ovaries, or from starved (2 days) females treated with 3% Virkon® in sterilized seawater. Batches of 10-30 pairs of L. salmonis salmonis ovaries were dissected and snap frozen in 1,5 ml tubes and stored at -80 ⁰C until DNA extraction...

(this is not a long read "modern" genome, but this sampling approach is still better than just "I collected 20 unrelated unspecified dirty individuals and stirred them all up together"). So, a good sampling strategy gives better genomes.

ADD REPLY
0
Entering edit mode

all very true.

Been, sideways, part of an insect genome project and a very very similar strategy was applied in that one as well indeed.

however .... 1-0 for the plant field :-D (though not always possible, in some cases we can 'grow' haploid tissue)

ADD REPLY
1
Entering edit mode

True, but some plants have other "problems", like hybrid polyploidy (like wheat, where this was itself an interesting feature to analyze).

Besides, as nice as it is (I am still happy I didn't have to do any of this), the inbreeding strategy becomes rather unsustainable once one's goal becomes to sequence each species on the planet.

ADD REPLY
1
Entering edit mode

absolutely.

let's call it a draw then ;)

ADD REPLY
0
Entering edit mode

forgot about the double haploid genome techniques .

... advantage plants !

ADD REPLY
0
Entering edit mode

Thank you for the reference, I'll definitely take a look.

What helps against highly heterozygous genome, btw?

With heterozygous genomes, highly accurate long reads like HiFi can resolve haplotigs nicely. Sometimes it can work with Nanopore too - if you have a high quality set of reads, and re-basecalled it with the newest "super high-accuracy" (basically Bonito-style) models.

If you know the genomes of "father" and "mother", you can bin the reads and assemble them separately.

ADD REPLY

Login before adding your answer.

Traffic: 1689 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6