Question

Basic de novo assembly of plant genome

1

Entering edit mode

3 months ago

Lemonhope ▴ 10

Hey everyone

denovo, , assembly, illumina, short read, plant, genome, beginner

I have a question regarding the de novo assembly of a plant genome.

So I have 13 Gb of paired-end short read (illumina) sequencing data (x10) for a diploid plant species. Not much is known about the genetics of the plant and my research hopes to elucidate some basic genome characteristics.

I want to generate a very basic de novo reference genome for future works but I do not need to annotate it. I just want to assemble it and use it for alignment if necessary. Currently I have already done this using Megahit to produce a reference file.

My question is, will this produce any problems downstream ? I know that Megahit is mostly used for metagenome assembly but sources state that it could be used for a basic assembly of a single genome as well and I interpreted that to include any short read sequencing data.

What do you guys think? Should I try another assembler ? If so, what would you guys recommend for low coverage plant genomes?

Lemonhope

Assembly • 810 views

ADD COMMENT • link updated 3 months ago by lieven.sterck 15k • written 3 months ago by Lemonhope ▴ 10

0

Entering edit mode

Hi, so it seems you have an idea of the size of the genome, right? Do you also know if its a selfing species? The last assembler we successfully used with Illumina data was platanus, there's a nice tutorial here: https://bioinformaticsworkbook.org/dataAnalysis/GenomeAssembly/Arabidopsis/AT_platanus-genome-assembly.html

ADD REPLY • link 3 months ago by b.contreras.moreira ▴ 370

0

Entering edit mode

I do have an idea of the genome size yes, but I am unaware of whether the species is able to self or not.

Thank you for the recommendation!

ADD REPLY • link 3 months ago by Lemonhope ▴ 10

0

Entering edit mode

Hey, Lemonhope here.

So maybe I was unclear in my original post, sorry, it was a bit late when I posted! My goal is to simply have a set of scaffolds that could be used for basic genome characteristic elucidation. I was wondering if I could use it to determine some basic genome characteristics such as BUSCO for an assessment of how many conserved genes could be identified. Additionally, I was also wondering if it could be used as a pseudo-reference to call SNPs, as I also have a large amount of ddRAD data that I could align to this genome.

My question is essentially: "would Megahit be an acceptable assembler granted these goals?"

Lemonhope

ADD REPLY • link 3 months ago by Lemonhope ▴ 10

score 2 · Answer 1 · 2024-11-06

2

Entering edit mode

3 months ago

lieven.sterck 15k

Hi,

10x coverage is really at the low low end for doing denovo genome assembly. Consequently the results will be of not too high quality (more likely even low quality). Depending on your goals that might just be sufficient or not at all.

With such a modest input dataset you could consider running a few assembly tools and compare the results, there are numerous others for doing assembly with short reads (abyss, spades, soapdenovo, ... ) .

So your genome has an appox size of > 1GB and you have diploid data (so not even from haploid tissue)? if so then 10x is really at the low end . Plant genomes of that size will also have a substantial fraction of repetitive sequences making it even harder to assemble them. The diploidy of your data will also make it harder to assemble especially if there is substantial heterozygosity , if it's (near) homozygous then this is less of an issue

ADD COMMENT • link 3 months ago by lieven.sterck 15k

0

Entering edit mode

Hey

So maybe I was unclear, sorry, it was a bit late when I posted! My goal is to simply have a set of scaffolds that could be used for basic genome characteristic elucidation. I was wondering if I could use it to determine some basic genome characteristics such as BUSCO for an assessment of how many conserved genes could be identified. Additionally, I was also wondering if it could be used as a pseudo-reference to call SNPs, as I also have a large amount of ddRAD data that I could align to this genome.

My question is essentially: "would Megahit be an acceptable assembler granted these goals?"

Lemonhope

ADD REPLY • link 3 months ago by Lemonhope ▴ 10

0

Entering edit mode

It's not the assembler that is the issue here (you could just as well go for any other) but it's the amount of input data. It's too low to get any decent assembly out of it.

For SNPs the resulting assembly might just be sufficient but again it will hardly give you a reliable overview of the whole genome.

For annotation purposes it will likely be insufficient as any annotation process will heavily be influenced by the assembly result/quality.

ADD REPLY • link 3 months ago by lieven.sterck 15k

score 1 · Answer 2 · 2024-11-06

You'll struggle to get a genome out of 10X coverage Illumina data. Even with 50X illumina I doubt you'll get beyond 20-50 kbp N50 in the best case. If it is diploid material assembly will be even worse.

Why ? Well plants tend to have 20-90% repeats so are almost impossible to assemble with short reads.

I would advise you to invest in a Nanopore minion starter pack or a single promethion or pacbio sequencing run if you really want to understand the genome or have anything publishable.