3a is indeed better, but I didn't mention or emphasize that it was an Illumina-only assembly and 3b and 3c are after adding 10x PacBio reads with PBJelly but 3b is with two-rounds of Pilon polishing and 3c is with one-round of Racon polishing. Doing further experimentation, I find that the variant caller Callvariants.sh from the BBMap/BBTools suite out-performs one-round of Pilon in terms of RNA-Seq mapping rates.
I am experimenting with that right now to determine whether Racon or Pilon polishing "is better" for an assembly where PacBio reads were used to fill in gaps. Briefly, I get similar assembly statistics, RNAseq mapping rates, indel rates (based on cigar strings and Illumina short-read alignment), and BUSCO scores with Racon and Pilon with a single round of polishing using either program.
Are you dealing with a large genome (> 2Gbp)? If so, running Pilon with default settings is trickier as it requires almost 1GB RAM per 1 MBp of sequence (so 2TB RAM by default for 2Gbp genome). I've written some scripts that generate custom "targets" files for each scaffold, so that Pilon runs with less RAM. Racon requires much less memory by default (~500-600GB) for 65x coverage of Illumina short-read data for a 2Gbp genome.
If you are concerned about removing indels, then the best way to determine if Racon or Pilon is better for your case is to run both, then predict proteins and run "ideel" (http://www.opiniomics.org/a-simple-test-for-uncorrected-insertions-and-deletions-indels-in-bacterial-genomes/) to see if you have a greater proportion of truncated proteins in the Racon or Pilon polished assembly. With Racon, you should be fine with one iteration of polishing. With Pilon, most people use two iterations, but I warn you that too many polishing iterations is a bad thing as I found a much higher proportion of truncated proteins with a Pilon assembly that was polished 8 rounds versus one that was polished 2 rounds. Your results may vary, but that is what I found with a 2Gbp mammalian genome and 65x Illumina short-read coverage using Pilon.
The fact about many rounds of Pilon polishing is particularly interesting, since Unicycler (that performs exceptionally for hybrid bacterial assembly) does use this exact strategy (polishing with Pilon until convergence, i.e. there are no more changes happening).
It seems it would be very useful if someone could do a more systematic evaluation. I will try to if I find the time.
So, this not quite a fair comparison (below) because Pilon was run twice, whilst Racon was only run once, but at least for this data set, Pilon seems to be doing better in terms of few truncated proteins, but I suspect that after an additional round of Illumina-polishing that Racon would be equivalent or surpass Pilon in terms of protein "truncatedness".
Figure 3. Histogram of query sequence length (predicted proteins) divided by subject sequence length (UniProt/Trembl proteins) for DIAMOND (Buchfink et al. 2015) mapped MAKER (Holt & Yandell 2011) predicted proteins against UniProt/Trembl release 2018_04 database for: (a) the original mammal genome; (b) the mammal genome after adding PacBio sequencing reads and two rounds of polishing with ~60x Illumina data with Pilon (Walker et al. 2014); and (c) the mammal genome after adding PacBio sequencing reads and one round of polishing with ~60x Illumina data with Racon. All are plotted with hist.R script from https://github.com/mw55309/ideel/blob/master/scripts/hist.R.
Data for Figure 3a.
"There are 21259 protein hits, and 11631 (54.71 %) are between 0.85 and 1.15 (querylength/subjectlength)."
"There are 7703 (36.23 %) protein hits less than 0.85 (querylength/subjectlength)."
Data for Figure 3b.
"There are 32296 protein hits, and 17267 (53.46 %) are between 0.85 and 1.15 (querylength/subjectlength)."
"There are 12716 (39.37 %) protein hits less than 0.85 (querylength/subjectlength)."
Data for Figure 3c.
"There are 32125 protein hits, and 16559 (51.55 %) are between 0.85 and 1.15 (querylength/subjectlength)."
"There are 13239 (41.21 %) protein hits less than 0.85 (querylength/subjectlength)."
Here is a frequency polygon to better see the differences (they are very minor). Ill-only for Illumina-only. Sorry that I don't have an image for the Callvariants.sh comparison, it was for another genome assembly.
? It looks here like 3a is the best result, and that's the one with no polishing.
3a
is indeed better, but I didn't mention or emphasize that it was an Illumina-only assembly and3b
and3c
are after adding 10x PacBio reads with PBJelly but3b
is with two-rounds of Pilon polishing and3c
is with one-round of Racon polishing. Doing further experimentation, I find that the variant callerCallvariants.sh
from theBBMap/BBTools
suite out-performs one-round of Pilon in terms of RNA-Seq mapping rates.