What marks a De-Novo Genome assembly as FAILED?
0
0
Entering edit mode
6 months ago
Umer ▴ 130

Hi Everyone,

A bit of background.

I am working with fungal genomes. Where i have to generate De-novo geneome assemblies for roughly around 45 illumina samples and 12 oxford nanopore Long read samples. These 12 ONT (Oxford Nanopore Technology) samples are also included in 45 illumina samples.

Now as soon as i get the sequences, i will be running multiple assemblers,

Illumina: SPADES + other for testing and comparison

ONT: Canu, Flye, SPADES for hybrid assembly and aslo just ONT-assembly.

The question that came up is "WHAT will mark a genome assembly to be a FAILED assembly for a particular assembler?" Are there any specific guidelines or any set of criteria that needs to be met.

Your Kind suggestions/thoughts on this mean alot. Thank you

nanopore denovo illumina assembly genome • 889 views
ADD COMMENT
1
Entering edit mode

I would urge you to consider different approaches for benchmarking before deciding on one. In particular, I found that assembling long reads and short reads together (hybrid) doesn't necessarily give you a better assembly.

For the samples where you have long+short reads, I would suggest,

  1. Error correct (only not assemble) the long reads using canu.
  2. Assemble the long reads using SMARTdenovo (or other dedicated long read assemblers but I had good results with SMARTdenovo).
  3. Index the resulting assembly using bwa and align the short reads back to it (do all the usual samtools steps to get a sorted BAM).
  4. Run PILON (https://github.com/broadinstitute/pilon) to error correct the assembly.
  5. Run QUAST (whilst also supplying either the BAM file or the short reads) on the error corrected assembly.
  6. Run BUSCO to assess the assembly beyond the usual N50 stats, contig length etc..

For the samples with no long reads, SPADES usually produces good assemblies for the size of genome you are dealing with, for more heterozygous genomes, PLATANUS also produced good assemblies for me in the past. Once you have finished assembling the short read only samples, you should run steps 5 and 6 on them as well.

In terms of assessing your assembly and whether it has "failed", there is no single metric from the stats that would tell you that. I don't like judging assemblies based on stats because they can be misleading and they tell you nothing about how well the reads assembled, i.e. the biology of the assembly. That is where BUSCO comes in, so in my case I would sacrifice N50/lengths etc, for higher BUSCO scores.

If however you get a significantly lower genome length than what you expect, there is a good chance that you either have low coverage, or perhaps high contamination (which obviously reduces your coverage even when the total number of reads is high).

One caveat to the method I suggested is that you need higher coverage from long reads (over 30x) for it to yield good results, if you don't then you can stick to the hybrid assembly approach where you co-assemble the long and short reads together.

Hope this helps and all the best.

ADD REPLY
0
Entering edit mode

Hi, Thank you for a detailed responce.

Let me add some more informations. Long-Read is ~75X coverage. Short-Read is ~100X coverage.

For Short-Read Samples, I am planning to go with SPADES as i got good results and number of contigs, with good assembly stats and BUsco in range above 90%

For Long-Read Samples, Initial plan was to do hybrid assembly from the start. but based on your reply, as i have higher coverage, I will also test your approach. I just need some clerifications.

  1. Error correct (only not assemble) the long reads using canu. (I understand this part)
  2. Assemble the long reads using SMARTdenovo. (For this I should be using the error corrected long-Reads?)
  3. I will be doing aseembly correction using both PILON and RECON.
  4. FIRST: Pilon polishing using illumina Short Read Data
  5. Secondly: RECON polishing using Long-read raw data (Not shure is this would be necessary ?)

What do you think of this approach. Specifically the polishing part with Pilon and RECON.

  1. Are both necessary?
  2. If yes, should i use error-corrected reads for recon polishing ?
  3. Spades does some error correction- should I use these reads for PIlon-polishing step or the RAW illumina fastq data ?

THanks.

ADD REPLY
0
Entering edit mode

I'm a little confused about your experimental design. Are you making 45 different assemblies? Or are all the samples from the same individual? Or are you making a pangenome assembly?

What do you mean by "fail"? The tools will likely emit even a very fractured assembly given bad data.

ADD REPLY
0
Entering edit mode

Yes. We are sequencing 45 different samples with illumina. 12 of these are also going to be sequenced with Nanopore (for hybrid assembly). Pangenome is also a future target.

Current target is to sequence. Check if strains have chromosomes which are not core chromosomes. And if there are any pathogenicity related genes. Etc.

By “fail” what i wanna ask is

  1. Assembles will output an assembly if at doesn’t face any argument or technical error. It will just run and output an assembly.
  2. If estimated genome size is ~50 mb. After getting assembly stats. What stats can mark if the assembly failed. Is it contig length, Q50 or self coverage that will eventually mark an assembly to be failed assembly with a perticular assembler.
ADD REPLY
0
Entering edit mode

I'm unsure you'll be able to identify whole chromosomes that are not core from illumina reads alone. That might be easier to do with a microscope. Illumina reads alone will not result in a chromosome level assembly. You'll likely get an okay idea of unique contigs though.

You could also check for BUSCOs to give you an idea of genome completeness. But if you're expecting extra chromosomes in some strains, then surely your size estimate wouldn't really be that useful. Or am I missing something?

ADD REPLY
0
Entering edit mode

I know illumina will only give me high-quality contigs. the samples which are to be sequenced with Long_Short read sequence will be used as reference to further join the contigs or atleast identify the accessory and core contigs.

ADD REPLY

Login before adding your answer.

Traffic: 1673 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6