Hi,
What tools you use or know for PacBio Long Read error correction and why? or what is its pros and cons?
Hi,
What tools you use or know for PacBio Long Read error correction and why? or what is its pros and cons?
Check out proovread. It ..
proovread maps high coverage data to pacbio reads (bwa mem, blasr, daligner) in multiple iterations. This is not the fastest (if speed is your concern, go for LoRDEC http://www.atgc-montpellier.fr/lordec/) but the most thorough approach, giving you the most out of your PacBio data. You find some comparative stats here.
I used proovread to correct PacBio cDNA reads. It worked out of the box
I tried PacBioToCA and LSC, both were way too slow in my settings.
For me it was also important to get the corrected but untrimmed reads to retain full length transcripts. PacBioToCA for instance did not provide this option.
When correcting with Illumina RNA-seq short read data it is also helpful to normalize the data first to further speed up the correction. I used normalize-by-median.py of the khmer package.
Works well with proovread, since proovread uses a coverage cutoff anyway and since it prioritizes reads mapping with fewer mismatches.
Just for information, the hyperlink for LoRDEC in your post does not seem to work (404 not found).
I've used PBcR in the Celera Assembler package.
It works for hybrid assembly as well as just PacBio assembly (actually, I found self-correction worked better than hybrid correction with the dataset that I worked with)
If you haven't seen them already, I would also recommend viewing the tutorials on the PacBio website.
I think there is at least 1-2 talks that review methods for de novo assembly and read correction.
Off the top of my head, I don't recall which assembly tools specifically have an error correction step.
Some de novo assembly tools that I recall include HGAP and MIRA. I think the computer associated with the sequencer should have a de novo assembly algorithm, which I think is HGAP. I think MIRA does an error correction, but it only works with for hybrid-correction with Illumina reads.
Quiver can also be used to polish assemblies (so, correct errors post-assembly rather than pre-assembly): https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/HowToQuiver.rst
In fact, the HGAP link appears to recommend using Quiver as part of the assembly pipeline.
There is "LoRDEC: accurate and efficient long read error correction", it uses DBG short reads to correct erroneous parts in Long reads.
It's a new program for correcting Long rads.
ECTools is the one that has worked best for me. It's written for a particular kind of grid computing system though, so you may have to modify step 8 from their tutorial to suite your particular environment.
For running on a single server (which will be pretty slow, but this is just an example of how to wrap their scripts for a different scheduling system) I used the following bash script instead of steps 8 and 9
#! /bin/bash
export TMPDIR=/a/directory/for/temporary_files
mkdir -p $TMPDIR
THREADS=12
NUM_PARTITIONS=0213 # should be a 4 character wide integer left-padded with zeros
NUM_FILES_PER_PARTITION=500
ORGANISM_NAME=organism_name
run_file() {
export SGE_TASK_ID=$1
../correct.sh
}
export -f run_file
for i in `eval echo {0001..$NUM_PARTITIONS}` #braces evaluated before variable
do
echo $i
cd $i
parallel -j $THREADS run_file ::: `eval echo {1..$NUM_FILES_PER_PARTITION}`
cd ..
done
cat ????/*.cor.fa > ${ORGANISM_NAME}.cor.fa
Well... according to the README file: "In short, the correction algorithm takes as input the unitigs from a short read assembly and uses them to correct long read data."
So, the answer to your question is yes. Although, much like some of the tools mentioned in other answers, it relies on short reads to do so.
PacBioToCA (error correction via Celera Assembler) can also be used for error correcting PacBio reads using short reads including Illumina's.
Another one is LSC
I am very happy with proovread.
It is extremely flexible with respect to the type of Illumina data (HiSeqs, MiSeq, unitigs etc), quite fast, completely tunable and the author (Thomas Hackl) is very responsive. We have used it to correct lots of PacBios and it is extremly stable.
In contrast to ECTools, which takes much much longer and gives cluster jobs with unpredictable runtimes (depending on how many repeats the PacBio reads have), proovread jobs have a predictable runtime with little variation, which makes it easy to tailor jobs to the requirements of a compute cluster (runtimes, # cores etc). Memory usage is minimal.
I used proovread recently to correct long reads by mapping short reads. For my volume I had to use HPC for one week to finnish correction. Author of this soft, Thomas, is very responsive, indeed. I also used pacbioToCA for pacbio self correction having 40x coverage, even though caveat for good performance is 50x. I didn't get satisfying results. With proovread you loose around 25% in pacbio length. With Celera in my case it was around 60%.
If we know the reference genome, why not correct the PacBio Transcriptome data using the transcript annotated from genome directly? The work will become only to align the PacBio reads to the reference transcripts. It seems easy work?
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
other algorithms for long reads errors correction?
what's chimeric positions?
thank you
PacBio reads can be chimeras - meaning a fusion of sequences that don't occur in that order in the sequenced sample. This can either happen, if subreads are not split properly (
--subread--adapter--rev-comp-subread--
) or during library preparation by random ligation of fragments. Chimeric positions should indicate such breakpoints in a readexcuse me I did not understand the definition of "chimeric reads".
Is there a clear definition?
thank you
http://drive5.com/usearch/manual/chimera_formation.html
The flag you mentioned:
--subread--adapter--rev-comp-subread--
What tool is that for? PacBio's consensus caller?
What are chimeric reads?