I've included some background info after the questions, which are first in cases of TL;DR
Questions:
Is it generally a best-practice to run
CircularConsensus
on SMRT cell DNAseq data before doing an analysis such as scaffolding or genome assembly?Under what circumstances would you not run
CircularConsensus
?
(I posted these same questions on seqanswers)
Background
Our collaborators sent us 9.4Gbps (12 SMRT cells) of plant DNA sequencing from RSII (P6C4, I think). We estimate this represents ~20x coverage of the plant's genome. All of the initial processing was done by the collaborators. Their last step was the filtering of subreads, which have a post-filter N50 of around 8,000.
My Goal
I am trying to use the reads to gap-fill and do additional scaffolding of a draft genome assembly.
Results
I ran PBjelly using uncorrected subreads providing the different Analysis_results directories for each SMRT cell. The results seem very good - about half of the gaps were filled and the scaffold N50 increased by 20%.
But I suspect that some of the filled gaps, especially those in repetitive areas, are not correct. When I look at the subread placement over each gap (produced by PBjelly), I noticed that some were filled, for example, by a minor proportion (N=2) the total subreads (N=9) from a ZMW. There were a few instances like this. It occurred to me that maybe it was a mistake to use the subreads rather than consensus sequences.
Cross-posted here.