Estimate insert size for pacbio long reads
0
0
Entering edit mode
6.6 years ago

I am trying to use pacbio long reads to do gap filling using GapFiller program. What is the correct way of specifying the insert size mentioned in library file ? In other words, how to estimate insert size for pacbio long reads (SMRT).

Thanks in Advance.

assembly next-gen gapfilling • 1.9k views
ADD COMMENT
0
Entering edit mode

You should probably use PBJelly (https://sourceforge.net/p/pb-jelly/wiki/Home/). Here are some instructions for running PBJelly for PacBio Sequel subreads and error correcting using a variant caller instead of Pilon to correct indels (

# goes along with http://seqanswers.com/forums/showthread.php?p=220925#post220925
#
# assumes you have PBJelly, blasr, tabix, bcftools, samtools installed
# below I am using a machine with 70 cores on a single node, adjust to the number of cores to your machine
# The scripts below are obviously not designed for use with a cluster, but can be modified
#
#########################
# STEP 1 Combine the FASTQ files and remove the originals to save space
#########################
## first combine files and delete the originals to save space
WorkDir=/genetics/elbers/pacbio
cd ${WorkDir}
cat ?.subreads.fastq > pacbio-genome-pbjelly.fastq
rm ?.subreads.fastq
## second add fake quality values (from Q0=! to Q30=> in Sanger encoding) for pbjelly to work
cat pacbio-genome-pbjelly.fastq | tr "!" ">" > pacbio-genome-pbjelly-fqual.fastq
#########################
# STEP 2 Create pbjelly-config-assemble.xml
#########################
## use nano to create the file then paste everything between the lines starting with #### and ending with ####
nano pbjelly-config-assemble.xml
####pbjelly-config-assemble.xml####
<jellyProtocol>
<reference>/genetics/pacbio/genome.fasta</reference>
<outputDir>/genetics/pacbio/</outputDir>
<cluster>
<command notes="For single node, multi-core machines" >${CMD} ${JOBNAME} 2> ${STDERR} 1> ${STDOUT} &amp;</command>
<nJobs>70</nJobs>
</cluster>
<blasr>-minMatch 8 -minPctIdentity 70 -bestn 1 -nCandidates 10 -maxScore -500 -nproc 1 -noSplitSubreads</blasr>
<input baseDir="/genetics/pacbio/">
<job>pacbio-genome-pbjelly-fqual.fastq</job>
</input>
</jellyProtocol>
####pbjelly-config-assemble.xml####
#########################
# STEP 3 Create fake qualities for reference to fill in gaps for
#########################
/opt/PBSuite_15.8.24/bin/fakeQuals.py genome.fasta genome.qual
#########################
# STEP 4 Summarize the assembly and reads
#########################
## first Summarize the assembly
/opt/PBSuite_15.8.24/bin/summarizeAssembly.py genome.fasta > genome.fasta.summary
## second Summarize the reads
/opt/PBSuite_15.8.24/bin/readSummary.py pbjelly-config-assemble.xml > pacbio-genome-pbjelly-fqual.fastq.summary
#########################
# STEP 5 Setup pbjelly
#########################
## note - this step takes about 1 hour
## note2 - pbjelly steps are submitted in the background, but even though it says step is
## finished, it might not be, you need to see if there are still pbjelly processes running
## for example "ps aux|grep -v "root"|less -S" to see if you still see PBJellySuite or
## blasr or Setup.py or Extraction.py or Support.py or Assemble.py or Out.py
/opt/PBSuite_15.8.24/bin/Jelly.py setup pbjelly-config-assemble.xml
#########################
# STEP 6 Map FASTQ reads to genome.fasta reference
#########################
/opt/PBSuite_15.8.24/bin/Jelly.py mapping pbjelly-config-mapping.xml
#########################
# STEP 7 Summarize mapping results
#########################
/opt/PBSuite_15.8.24/bin/Jelly.py support pbjelly-config-assemble.xml -x "--minMapq=50"
#########################
# STEP 8 Extract reads that appear to bridge gaps
#########################
/opt/PBSuite_15.8.24/bin/Jelly.py extraction pbjelly-config-assemble.xml
#########################
# STEP 9 Assemble reads to fill in gaps
#########################
/opt/PBSuite_15.8.24/bin/Jelly.py assembly pbjelly-config-assemble.xml
#########################
# STEP 10 Produce output with gaps filled in
#########################
/opt/PBSuite_15.8.24/bin/Jelly.py output pbjelly-config-assemble.xml
#########################
# STEP 11 Convert het bases to hom and convert lower to uppercase
#########################
## see https://github.com/lh3/seqtk for a link to download seqtk
/opt/seqtk/seqtk randbase jelly.out.fasta |/opt/seqtk/seqtk seq -U > ../jelly.out.upper.fasta
#########################
# STEP 12 Run a variant caller or use Pilon (twice)
# to correct indels in jelly.out.fasta PacBio filled in Gaps using Illumina reads
#########################
## personally, whether you use Pilon or a Variant Caller (such as callvariants.sh from BBTools), I found that mapping with BBMap with the vslow=t option had better indel correction for simulated data
## for real data, using the steps below outperformed Pilon in terms of better RNA-Seq mapping rates
##
## presumably you can use your 10x reads to correct indels?
## calling this paired-end read file as 10x-illumina-quality-controlled-and-trimmed-reads.fq.gz in steps below
#########################
# STEP 13 Error correct with BBMap once
#########################
cd /genetics/user
# download latest version of BBTools
wget https://sourceforge.net/projects/bbmap/files/BBMap_38.25.tar.gz
tar xzf BBMap_38.25.tar.gz
cd bbmap
# compile jni components
cd jni
# add to ~/.bashrc export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-amd64" or where ever your jvm machine is installed
# then source ~/.bashrc
make -f makefile.linux
cd ${WorkDir}
# map reads with bbmap 38.25 in vslow mode using jni
/genetics/user/bbmap/bbmap.sh -Xmx350g -eoom usejni=t pigz=t threads=75 vslow=t ref=jelly.out.upper.fasta in=10x-illumina-quality-controlled-and-trimmed-reads.fq.gz \
out=10x-illumina-quality-controlled-and-trimmed-reads-mapped-to-jelly.out.upper.fasta-using-bbmap.bam overwrite=t 2> bbmap.log &
# callvariants 38.25
/genetics/user/bbmap/callvariants.sh -Xmx350g threads=75 ref=jelly.out.upper.fasta in=10x-illumina-quality-controlled-and-trimmed-reads-mapped-to-jelly.out.upper.fasta-using-bbmap.sorted.bam \
ploidy=2 vcf=10x-illumina-quality-controlled-and-trimmed-reads-mapped-to-jelly.out.upper.fasta-using-bbmap.vcf minscore=20.0 overwrite=t 2> callvariants.log &
# use bgzip and tabix to compress then index vcf file
bgzip -f 10x-illumina-quality-controlled-and-trimmed-reads-mapped-to-jelly.out.upper.fasta-using-bbmap.vcf
tabix -p vcf 10x-illumina-quality-controlled-and-trimmed-reads-mapped-to-jelly.out.upper.fasta-using-bbmap.vcf.gz
# use bcftools to get consensus sequence
bcftools consensus -f jelly.out.upper.fasta 10x-illumina-quality-controlled-and-trimmed-reads-mapped-to-jelly.out.upper.fasta-using-bbmap.vcf.gz -o jelly.out.upper.bbmap-1.fasta 2> bcftools.log &
. I also have another gist if you want to some instructions to run Pilon (
#! /bin/bash
set -e
# installing fasta-splitter.pl
## wget http://kirill-kryukov.com/study/tools/fasta-splitter/files/fasta-splitter-0.2.6.zip
## unzip fasta-splitter-0.2.6.zip
# assumes initial genome to be error-corrected by pilon is called
## genome.pilon-0.fasta
# set variables
# RAM in gigabytes
RAM=350
interleavedReadsPath=/genetics/pacbio/pilon-vs-racon/chr20.pbjelly.fastq.gz
workDir=/genetics/pacbio/pilon-vs-racon/pilon
fastaSplitterPath=/genetics/pacbio/fasta-splitter.pl
samtoolsPath=samtools
bwaPath=bwa
pilonPath=/opt/pilon/pilon-1.22.jar
seqtkPath=/opt/seqtk/seqtk
# number of cores on computer
numThreads=75
# start of script
x=0
while [ $x -lt 2 ]
do
y=$(($x+1))
### a get the contig lengths with ${samtoolsPath} faidx
cd ${workDir}
${samtoolsPath} faidx genome.pilon-${x}.fasta
### b split genome into 39 parts with fasta-splitter.pl
perl ${fastaSplitterPath} --n-parts 39 genome.pilon-${x}.fasta --out-dir pilon-${y}/
cd ${workDir}/pilon-${y}
### c create a sequence from 01,02,...,39
seq -w 1 39 > samples
### d get the contig names
while read i;do grep ">" genome.pilon-${x}.part-${i}.fasta|perl -pe "s/>//g" > targetlist${i}; done < samples
### e get the contig lengths
cut -f 1-2 ../genome.pilon-${x}.fasta.fai > pilon.chromosome.lengths.txt
### f get the names of the contigs that are in split files where there are less than 10 contigs
cp samples samples-to-split
### g write script to make target lists for pilon
### The script below breaks up the task into ${numThreads} pieces
while read i
do
lines="$(wc -l targetlist${i}|cut -d " " -f 1)"
while read line;
do
chrlength="$(grep -P "^$line\t" pilon.chromosome.lengths.txt | cut -f 2)"
numberofsplits="$(perl -w -e "use POSIX; print ceil(${numThreads}/$lines)")"
numberofsplits2=$(($numberofsplits - 1))
splitlength="$(perl -w -e "use POSIX; print ceil($chrlength/$numberofsplits)")"
splits=1
while [[ $splits -lt $numberofsplits ]]
do
if [[ $splits -eq 1 ]]
then
splitlength3="$(($splitlength + 1))"
stopbase=$(($splitlength * ($splits + 1)))
echo "${line}:1-${splitlength}" >> targetlist-${i}
echo "${line}:${splitlength3}-${stopbase}" >> targetlist-${i}
((splits++))
elif [[ $splits -eq $numberofsplits2 ]]
then
startbase=$((($splitlength * ($splits)) + 1))
echo "${line}:${startbase}-${chrlength}" >> targetlist-${i}
((splits++))
else
stopbase=$(($splitlength * ($splits + 1)))
startbase=$((($splitlength * $splits) + 1))
echo "${line}:${startbase}-${stopbase}" >> targetlist-${i}
((splits++))
fi
done
done < targetlist${i}
done < samples-to-split
### h rename targetlist-01 to targetlist01
while read i;do
mv -f targetlist-${i} targetlist${i}
done < samples-to-split
### i make a list of the targetlists
while read i;do
ls targetlist${i} >> targetlists
done < samples
### j bwa index, bwa mem, sam2bam, samtools sort, samtools index
cd ${workDir}
${bwaPath} index -a bwtsw genome.pilon-${x}.fasta > bwa-index.log 2>&1
${bwaPath} mem -p -M -t ${numThreads} genome.pilon-${x}.fasta \
${interleavedReadsPath} 2> bwa.log |\
${samtoolsPath} view -@${numThreads} -F 4 -Sb - |${samtoolsPath} sort -@${numThreads} - > reads-mapped-to-genome.pilon-${x}.fasta.bam
${samtoolsPath} index -@${numThreads} reads-mapped-to-genome.pilon-${x}.fasta.bam
### k run pilon (takes about 1-2 days)
cd ${workDir}/pilon-${y}
while read i;do
java -Xmx${RAM}g -jar ${pilonPath} \
--genome ../genome.pilon-${x}.fasta \
--frags ../reads-mapped-to-genome.pilon-${x}.fasta.bam \
--output ${i}-pilon \
--targets ${i} \
--diploid \
--changes \
--fix bases --threads ${numThreads} > ${i}-pilon.log 2>&1
done < targetlists
### l combine output files
#### i get rid of extra headers in each output file
while read i;do
${seqtkPath} seq -l0 targetlist${i}-pilon.fasta | awk '!seen[$1]++'|${seqtkPath} seq -l80 > targetlist${i}-pilon.fasta2
done < samples
#### ii combine the output files
cat $(find ./ -name "targetlist*-pilon.fasta2" | sort -V) > pilon.fasta
#### iii change output file name and remove "_pilon" from end of contig/scaffold names
${seqtkPath} randbase pilon.fasta | ${seqtkPath} seq -U > ../genome.pilon-${y}.fasta
perl -pi -e "s/_pilon//g" ../genome.pilon-${y}.fasta
### m increase loop value
((x++))
done
, but I have found that BBTool's variant caller CallVariants.sh) did a better job in terms of BUSCO scores and RNA-Seq mapping rates than running Pilon.

ADD REPLY

Login before adding your answer.

Traffic: 1958 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6