Question

When should you de novo assemble a whole-genome, and when should you simply align it to a reference?

1

Entering edit mode

7.6 years ago

olavur ▴ 150

Say I have sequenced the whole-genome of an individual or set of individuals. I have some task at hand, and need to decide whether I want to just align the reads to e.g. GRCh38, or if I want to de novo assemble each whole-genome. I imagine there are pros and cons with both methods, and as such which method I should choose depends on the task at hand. Is this the case? What are the differences?

assembly alignment whole-genome • 3.4k views

ADD COMMENT • link updated 7.6 years ago by Istvan Albert 102k • written 7.6 years ago by olavur ▴ 150

1

Entering edit mode

What task do you want to fulfill? I think it really depends on your task, for example de novo assembly will make you loose information about variants or depth whereas read alignment would possibly lead to mistakes if you have a lot af repeat regions...

Many studies are going with the two approaches in parallel

ADD REPLY • link 7.6 years ago by vmicrobio ▴ 290

0

Entering edit mode

Ok, so de novo assembly is good for finding for example structural variants and large CNVs, but alignment to a reference is better for SNPs and small indels and CNVs.

Can you elaborate a little bit on how information about depth is lost?

Using both approaches in parallel makes a lot of sense, if one wants to find as many types of variation as possible.

ADD REPLY • link 7.6 years ago by olavur ▴ 150

1

Entering edit mode

de novo assembly will generate you multifasta files containing your contigs, you'll have larger fragments but without information about variation and depth at a position. However you can retrieve these informations from the alignment you'll do in parallel

ADD REPLY • link 7.6 years ago by vmicrobio ▴ 290

score 3 · Answer 1 · 2017-06-28

3

Entering edit mode

7.6 years ago

Istvan Albert 102k

De novo assembly would be used primarily when we expect to see large-scale variations that are not present in the reference genome.

Or alternatively when the variations are such that aligning the reads to the reference genome would produce confusing or ambiguous alignments from which we would be unable to correctly reconstruct the original sequence.

ADD COMMENT • link 7.6 years ago by Istvan Albert 102k

0

Entering edit mode

What would cause a read to cause confusing or ambiguous alignments? You're not just talking about structural variants and large CNVs here?

ADD REPLY • link 7.6 years ago by olavur ▴ 150

1

Entering edit mode

Hi,

With shorter reads (e.g. Illumina) sometimes reads get aligned randomly due to high similarity between genes. E.g. if homologues share 99% identity, the mapper is not able to tell where the reads should be aligned and therefore does this randomly. As a result it looks like the two homologues might have heterozygous SNPs, but Sanger sequencing will confirm that this is not the case. With longer reads (e.g. PacBio) this particular issue is not present as the reads span the whole ORF.

ADD REPLY • link 7.6 years ago by yeastngs ▴ 10