MIRA runs yielding different assemblies with same input data and parameters
0
0
Entering edit mode
9 months ago
btc347 • 0

Hi everyone,

I am currently doing de novo assembly of metavirome samples, and I've been using MIRA to assemble contigs from some samples with lots of repeats where metaviralSPAdes was failing to assemble anything. However, I'm running into an issue with MIRA where multiple runs of my pipeline with the same input data and parameters are giving me different assemblies. Is this normal behavior?

The following has been my pipeline using my sample "Farm-3" as an example. First, I used bbnorm to downsample my data to approximately the average depth recommended for Illumina data by MIRA (~80x). I then ran MIRA with the following config file:

  • project = Farm-3
  • job = genome,denovo,accurate
  • readgroup = DataIlluminaPairedEnd500Lib
  • data = Farm-3_R1_001_clean_norm.fastq Farm-3_R2_001_clean_norm.fastq
  • technology = solexa
  • rename_prefix = M06453:23:000000000-K253L Farm-3_
  • template_size = 5 2000 autorefine
  • segment_placement = ---> <---
  • parameters = COMMON_SETTINGS -NW:cac=warn

Using these parameters, MIRA was able to generate contigs from my samples that were not being assembled by metaviralSPAdes. Below are some statistics from the "info_assembly.txt" file output from MIRA for "Farm-3".

  • Num. reads assembled: 76056
  • Num. singlets: 0
  • Large contigs:
  • Number of contigs: 4
  • Total consensus: 202436
  • Largest contig: 195702
  • N50 contig size: 195702

Additionally, one of the contigs belonged to a large virus (~195 kb) which I was told was likely to be present in these samples. So everything looked good. However, when I later reran MIRA with the same parameters on the same downsampled data (piped to a different output directory), I observed that I had different number of contigs for "Farm-3" than in my prior run. Additionally, the ~195 kb contig was no longer present in the output:

  • Num. reads assembled: 76029
  • Num. singlets: 0
  • Large contigs:
  • Number of contigs: 6
  • Total consensus: 203027
  • Largest contig: 102502
  • N50 contig size: 102502

Looking through the MIRA logfiles for both runs of "Farm-3", I see that both MIRA runs performed 5 passes. Notably, the ~195 kb contig is identified in the first pass by both runs. It is identified twice more in subsequent passes in the first "successful" run, but it is not identified again in the second run. Additionally, for both runs, the input data numbers (reads, used bases, GC content etc.) are basically identical prior to the beginning of the first pass. After the first pass, these numbers begin to diverge slightly.

So long story short, I'm getting different results MIRA despite using the same input data and parameters. Based on my (very limited) understanding of MIRA, I didn't think there was anything about MIRA's algorithm that would cause differences between runs (though again I could be completely wrong there). Does anyone have any experience with this or have any recommendations for how to proceed? I know I could just take my assembled contigs and use them downstream, but obviously I'd like to have my analysis be reproducible.

I can also try and link the MIRA logfiles for both runs somewhere if that would be useful for anyone.

Thanks!

MIRA replication denovo assembly • 241 views
ADD COMMENT

Login before adding your answer.

Traffic: 2552 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6