This post is a followup to a previous post: FASTQC and PacBio reads
I am trying to use the PBcR pipeline for the Celera Genome Assembler (v8.3) to perform HGAP for pacbio reads (http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR).
I've got the assembler installed, and I was able to successfully assemble the lambda genome in the example provided on the wiki page (see link above). I then tried running the assembler on my own PacBio reads using the following script:
#Celera genome assembler directory
CELERA="~/wgs-8.3rc2/Linux-amd64/bin/"
#Output directory
OUT="celera_output_3"
#Variables from parameters
FILE=$1
NAME=$2
SPEC=$3
#Raw data directory
#RAW="raw_data_test_phage"
#Perl environment variable
export PERLLIB=~/perl/modules/lib/perl5
export PERL5LIB=~/perl/modules/lib/perl5
#Create output directory and switch to it
mkdir -p $OUT/$NAME
cd $OUT/$NAME
#Run assembler
$CELERA/PBcR -length 5000 -s ../../$SPEC -l $NAME -fastq ../../$FILE genomeSize=50000
I do not get a asm.asm or asm.qc file. I also don't see any obvious errors in the log files. Then again, the log file that the celera assembler produces is quite long and I may be missing something. The structure of the output (i.e. files and directories) looks like this:
|-- [NAME]
| |-- 0-mercounts
| |-- 0-mertrim
| |-- 0-overlaptrim
| |-- 0-overlaptrim-overlap
| |-- 1-overlapper
| |-- 3-overlapcorrection
| |-- 4-unitigger
| |-- 5-consensus
| |-- 5-consensus-coverage-stat
| |-- 5-consensus-insert-sizes
| |-- asm.gkpStore
| |-- asm.gkpStore.err
| |-- asm.gkpStore.errorLog
| |-- asm.gkpStore.fastqUIDmap
| |-- asm.gkpStore.info
| |-- asm.ovlStore
| |-- asm.ovlStore.err
| |-- asm.ovlStore.list
| |-- asm.tigStore
| `-- runCA-logs
|-- [NAME].correction.err
|-- [NAME].correction.hist
|-- [NAME].fasta
|-- [NAME].fastq
|-- [NAME].frg
|-- [NAME].log
|-- [NAME].longest25.fastq -> [NAME].fastq
|-- [NAME].longest25.frg -> [NAME].frg
|-- [NAME].qual
`-- temp[NAME]
|-- 1-overlapper
|-- [NAME].frg
|-- [NAME].spec
|-- asm.eidToIID
|-- asm.gkpStore.err
|-- asm.gkpStore.errorLog
|-- asm.gkpStore.fastqUIDmap
|-- asm.gkpStore.info
|-- asm.hist
|-- asm.ignore
|-- asm.iidToLen
|-- asm.layout.err
|-- asm.layout.hist
|-- asm.layout.success
|-- asm.ovlStore.err
|-- asm.ovlStore.list
|-- asm.seedlength
|-- asm.split.allEdit
|-- asm.split.uid
|-- asm.toerase.err
|-- asm.toerase.out
|-- asm.toerase.uid
|-- asm.totalInputBP
|-- corrected.log
|-- runCA-logs
|-- runCorrection.sh
`-- runPartition.sh
So my questions are as follows:
- Why am I not getting an asm.asm (the assembly I assume) or a asm.qc (assembly statistics) file?
- If the assembly failed, where in the logs can I get an indication as to why it failed?
- The lambda example included a parameter called -partitions. What is this parameter? I couldn't find an explanation for it and I didn't include it in my script
- The raw data that we recieved all had the suffix .subreads.fastq. Is there a post-processing step that needs to be run before I run assembly?
There was a _utgcnsfix file, but no _utgcns file. The contents of this file (1446144349_sipsey-compute-1-12.local_20669_utgcnsfix) were as follows:
I browsed through the other files in this directory, and I didn't see any obvious error messages. All of the other files looked like this. The contents of the runCA-logs directory looks like this:
Does any of this look unusual? In the mean time, I'll queue up another run with increased memory.