Question

Assembly with repetitive regions

0

Entering edit mode

3.3 years ago

msrch • 0

Hello all!

I am assembling a synthetic genome sequenced with Oxford Nanopore. The problem is that I obtain a very repetitive assembly with all the contigs such as ATATATATATATATATAT. I do not understand why because I have repeated the process with the Acinetobacter pittii genome, and it seems normal and similar to the reference.

I am new to Oxford Nanopore and assemblies, and although I have read the papers, I cannot understand why this is happening. Is it because the contigs only overlap in repetitive regions, and then the consensus can only use these regions to build the assembly?

Thank you in advance for any help

Flye Assembly Canu SPAdes • 1.5k views

ADD COMMENT • link 3.3 years ago by msrch • 0

0

Entering edit mode

Could you specify what "synthetic genome" means in your case?

ADD REPLY • link 3.3 years ago by Michael 55k

0

Entering edit mode

By synthetic genome I mean a genome of synthetic DNA for data storage. In the reference, it is made of 42,000 reads of 120 bp long. Specifically, the dataset has been taken from this publication: https://www.researchsquare.com/article/rs-27205/v1 and this GitHub: https://github.com/helixworks-technologies/dos

It is the 3xr6 dataset in the repository.

ADD REPLY • link 3.3 years ago by msrch • 0

score 2 · Accepted Answer · 2021-08-24

You cannot - and do not need to - "assemble" these artificial sequences using a genome assembler, because assumptions made for genome assembly are violated by these sequences. The basic assumption of assembly is that obtained sequenced fragments are partial (or even complete) randomly distributed sub-sequences of a set of larger distinct entities: sequence replicons (e.g. chromosomes). Identical stretches of sequences (overlaps, consensus) between fragments either come from the same location of the same replicon and can therefore be used to stitch together the original replicon or are results of sequence duplication or repeats.

There is no greater "genome", all sequences are artifacts
All sequences are shorter than the read-length of the sequencing machine and can therefore be recovered in full. Additional coverage can be - and should be - used for error correction of fragments by consensus.
Identical sequences are artifacts and have no meaning towards a possible origin on a replicon:

All sequences in the 3xr6 oligo pool contain the same forward and reverse priming regions for PCR-compatibility.

These sequences will most likely be completely removed because of the high coverage of the identical regions that could be interpreted as adapter contamination. Further, if you trimmed those "adapters", the manuscript states that the remaining sequences are all unique (orthogonal, if I understood that correctly) and therefore would not provide further consensus information.

Thus, whatever genome assembly method is used on the data, the result is moot because the input is not a genome.