Question

Using other individuals and related species to improve a de novo genome assembly

1

Entering edit mode

7 months ago

Ed 991 ▴ 10

Hi all - I have a question regarding how to generate a "good enough" genome assembly for comparative genomics purposes (across species). For some species, the only sequencing data I have available is low-coverage (around 20X) 150bp Illumina paired reads. I do have sequencing data from two different, closely related individuals though, and several good-quality assemblies are available for closely related species. I have tried using SPades (after quality control etc), but the assembly is extremely fragmented, with a very low BUSCO score (around 20% C, 40% F), which is what one would expect given the low coverage. I could try alternative assemblers (SOAPdenovo2, Abyss, MaSuRCA etc), but have no reason to believe the results would be any better.

Is there a way to use the sequencing data from the other related individual and/or the reference sequences from closely related species to improve my assembly? The genome I want to generate an assembly for is a mollusc genome with an expected size of around 1.5Gb. I have tried to find information about reference-guided genome assembly, but nothing seems to quite fit my particular case. Unfortunately, generating better sequencing data from the species in question will not be possible, and it would be disappointing not to be able to use the data available!

Thanks very much - any help and suggestions would be appreciated

Illumina • 795 views

ADD COMMENT • link updated 7 months ago by Darked89 4.7k • written 7 months ago by Ed 991 ▴ 10

score 2 · Answer 1 · 2025-03-05

2

Entering edit mode

7 months ago

Buffo ★ 2.4k

I do not consider using sequencing data from closely related species to be a good idea for genome assembly. It might introduce more complexity to the graph construction, which leads to an even more fragmented assembly or chimeric contigs. If the genomes are syntenic (the new ones and those "good-quality"), I suggest using ragtag to generate scaffolds and reduce fragmentation. I would also require extra validation, but it might be something worth doing.

ADD COMMENT • link 7 months ago by Buffo ★ 2.4k

0

Entering edit mode

Thanks - that definitely looks like something worth looking into. Will give it a try!

ADD REPLY • link 7 months ago by Ed 991 ▴ 10

score 1 · Answer 2 · 2025-03-07

If you are getting very fragmented assembly after throwing at it various assemblers then probably your data is not good enough for the genome in question, be it because of the coverage, repeats, etc. Me thinks there is no easy way out of it except throwing in some PacBio or Nanopore reads to bridge the gaps. Or if that is not an option long inserts Illumina reads if you do not already have them.

As for the genomes of the closely related species you may try to map your reads to them to get an idea which of the genomes is closest (assuming these are fairly complete) and with some luck obtain regions/contigs to which your reads map well. I have a fondness for the LAST aligner which can handle anything from short read mappings to genome to genome alignments.

Last but not least (no idea about ploidy of your genome): is there anything to decrease the complexity of the DNA you are sequencing? I guess getting haploid genome of the species may not be trivial or possible? Flow sorting chromosomes? Enriching non-repetitive fragments?