Hi, I have 11x PacBio Sequel coverage for 2.27 Gbp (k-mer estimated) genome and was wondering if it might be enough to use PacBio's Arrow tool to call the consensus sequence. It seems possible (https://github.com/PacificBiosciences/GenomicConsensus/blob/develop/doc/FAQ.rst), but I have feelings that it might not be enough coverage (see comment about Racon below).
Background: We are improving a mammalian genome that was de novo assembled with 65x Illumina 500bp insert reads along with a 5kbp insert library using Abyss. We then had Dovetail Genomics generate Hi-C and Dovetail Chicago libraries to order and scaffold the assembly further creating super scaffolds. We then generated 11x PacBio Sequel coverage and filled in gaps in the assembly with PBJelly. Next, we polished the assembly with Pilon (fixed SNPs and indels) using the same Illumina short insert libraries used in de novo assembly. Then, we filled in up to 1kbp gaps with Abyss Sealer using the same Illumina short insert libraries used in de novo assembly. Finally, we polished the assembly again with a second round of Pilon, fixing SNPs and indels, but also filling in gaps. We noticed that many of our proteins are truncated after predicting proteins with MAKER2 (see http://www.opiniomics.org/a-simple-test-for-uncorrected-insertions-and-deletions-indels-in-bacterial-genomes/). Thus, we are interested in further polishing - we are running Pilon several more times but would like to try Arrow (we tried Racon [https://github.com/isovic/racon] to call consensus with PacBio reads before our first run of Pilon, but there is apparently too low of coverage for Racon to work properly [BUSCO completeness and RNAseq mapping success went down] with 11x coverage - hence the concern about using Arrow).
Has anyone used Arrow with 11x PacBio Sequel coverage?
You've gone quite far already, so I doubt you'll have much more success using Arrow given the low PacBio coverage. Maybe correcting the PacBio reads (for example using LoRDEC or proovread) can help you. In case you're open to an alternative start you might want to try MaSuRCA which could combine all of your input data. Then there's options to possibly merge the assemblies, for example using BBtool's Dedupe.
Thank you I will definitely consider LoRDEC or proovread initially if subsequent rounds of Pilon do not improve the protein truncatedness (new word ha!) of the assembly. I think additional rounds might improve protein truncatedness as after each round of Pilon correction, the number of corrected bases represented by SNPs or indels goes down by half.
Thanks for the table - I sort of recall I read about the iterative approach with Pilon but this is a great reminder not to stop after the first pass (I got spoiled by too good data for much smaller genomes recently)