Estimate insert size for pacbio long reads

Entering edit mode

6.6 years ago

arunprasanna83 ▴ 60

I am trying to use pacbio long reads to do gap filling using GapFiller program. What is the correct way of specifying the insert size mentioned in library file ? In other words, how to estimate insert size for pacbio long reads (SMRT).

Thanks in Advance.

assembly next-gen gapfilling • 1.9k views

ADD COMMENT • link 6.6 years ago by arunprasanna83 ▴ 60

Entering edit mode

You should probably use PBJelly (https://sourceforge.net/p/pb-jelly/wiki/Home/). Here are some instructions for running PBJelly for PacBio Sequel subreads and error correcting using a variant caller instead of Pilon to correct indels (

	# goes along with http://seqanswers.com/forums/showthread.php?p=220925#post220925
	#
	# assumes you have PBJelly, blasr, tabix, bcftools, samtools installed
	# below I am using a machine with 70 cores on a single node, adjust to the number of cores to your machine
	# The scripts below are obviously not designed for use with a cluster, but can be modified
	#
	#########################
	# STEP 1 Combine the FASTQ files and remove the originals to save space
	#########################
	## first combine files and delete the originals to save space
	WorkDir=/genetics/elbers/pacbio

	cd ${WorkDir}
	cat ?.subreads.fastq > pacbio-genome-pbjelly.fastq
	rm ?.subreads.fastq

	## second add fake quality values (from Q0=! to Q30=> in Sanger encoding) for pbjelly to work
	cat pacbio-genome-pbjelly.fastq \| tr "!" ">" > pacbio-genome-pbjelly-fqual.fastq

	#########################
	# STEP 2 Create pbjelly-config-assemble.xml
	#########################
	## use nano to create the file then paste everything between the lines starting with #### and ending with ####
	nano pbjelly-config-assemble.xml
	####pbjelly-config-assemble.xml####
	<jellyProtocol>
	<reference>/genetics/pacbio/genome.fasta</reference>
	<outputDir>/genetics/pacbio/</outputDir>
	<cluster>
	<command notes="For single node, multi-core machines" >${CMD} ${JOBNAME} 2> ${STDERR} 1> ${STDOUT} &</command>
	<nJobs>70</nJobs>
	</cluster>
	<blasr>-minMatch 8 -minPctIdentity 70 -bestn 1 -nCandidates 10 -maxScore -500 -nproc 1 -noSplitSubreads</blasr>
	<input baseDir="/genetics/pacbio/">
	<job>pacbio-genome-pbjelly-fqual.fastq</job>
	</input>
	</jellyProtocol>
	####pbjelly-config-assemble.xml####

	#########################
	# STEP 3 Create fake qualities for reference to fill in gaps for
	#########################
	/opt/PBSuite_15.8.24/bin/fakeQuals.py genome.fasta genome.qual

	#########################
	# STEP 4 Summarize the assembly and reads
	#########################
	## first Summarize the assembly
	/opt/PBSuite_15.8.24/bin/summarizeAssembly.py genome.fasta > genome.fasta.summary

	## second Summarize the reads
	/opt/PBSuite_15.8.24/bin/readSummary.py pbjelly-config-assemble.xml > pacbio-genome-pbjelly-fqual.fastq.summary

	#########################
	# STEP 5 Setup pbjelly
	#########################
	## note - this step takes about 1 hour
	## note2 - pbjelly steps are submitted in the background, but even though it says step is
	## finished, it might not be, you need to see if there are still pbjelly processes running
	## for example "ps aux\|grep -v "root"\|less -S" to see if you still see PBJellySuite or
	## blasr or Setup.py or Extraction.py or Support.py or Assemble.py or Out.py
	/opt/PBSuite_15.8.24/bin/Jelly.py setup pbjelly-config-assemble.xml

	#########################
	# STEP 6 Map FASTQ reads to genome.fasta reference
	#########################
	/opt/PBSuite_15.8.24/bin/Jelly.py mapping pbjelly-config-mapping.xml

	#########################
	# STEP 7 Summarize mapping results
	#########################
	/opt/PBSuite_15.8.24/bin/Jelly.py support pbjelly-config-assemble.xml -x "--minMapq=50"

	#########################
	# STEP 8 Extract reads that appear to bridge gaps
	#########################
	/opt/PBSuite_15.8.24/bin/Jelly.py extraction pbjelly-config-assemble.xml

	#########################
	# STEP 9 Assemble reads to fill in gaps
	#########################
	/opt/PBSuite_15.8.24/bin/Jelly.py assembly pbjelly-config-assemble.xml

	#########################
	# STEP 10 Produce output with gaps filled in
	#########################
	/opt/PBSuite_15.8.24/bin/Jelly.py output pbjelly-config-assemble.xml

	#########################
	# STEP 11 Convert het bases to hom and convert lower to uppercase
	#########################
	## see https://github.com/lh3/seqtk for a link to download seqtk
	/opt/seqtk/seqtk randbase jelly.out.fasta \|/opt/seqtk/seqtk seq -U > ../jelly.out.upper.fasta

	#########################
	# STEP 12 Run a variant caller or use Pilon (twice)
	# to correct indels in jelly.out.fasta PacBio filled in Gaps using Illumina reads
	#########################
	## personally, whether you use Pilon or a Variant Caller (such as callvariants.sh from BBTools), I found that mapping with BBMap with the vslow=t option had better indel correction for simulated data
	## for real data, using the steps below outperformed Pilon in terms of better RNA-Seq mapping rates
	##
	## presumably you can use your 10x reads to correct indels?
	## calling this paired-end read file as 10x-illumina-quality-controlled-and-trimmed-reads.fq.gz in steps below

	#########################
	# STEP 13 Error correct with BBMap once
	#########################
	cd /genetics/user
	# download latest version of BBTools
	wget https://sourceforge.net/projects/bbmap/files/BBMap_38.25.tar.gz
	tar xzf BBMap_38.25.tar.gz
	cd bbmap
	# compile jni components
	cd jni
	# add to ~/.bashrc export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-amd64" or where ever your jvm machine is installed
	# then source ~/.bashrc
	make -f makefile.linux

	cd ${WorkDir}

	# map reads with bbmap 38.25 in vslow mode using jni
	/genetics/user/bbmap/bbmap.sh -Xmx350g -eoom usejni=t pigz=t threads=75 vslow=t ref=jelly.out.upper.fasta in=10x-illumina-quality-controlled-and-trimmed-reads.fq.gz \
	out=10x-illumina-quality-controlled-and-trimmed-reads-mapped-to-jelly.out.upper.fasta-using-bbmap.bam overwrite=t 2> bbmap.log &

	# callvariants 38.25
	/genetics/user/bbmap/callvariants.sh -Xmx350g threads=75 ref=jelly.out.upper.fasta in=10x-illumina-quality-controlled-and-trimmed-reads-mapped-to-jelly.out.upper.fasta-using-bbmap.sorted.bam \
	ploidy=2 vcf=10x-illumina-quality-controlled-and-trimmed-reads-mapped-to-jelly.out.upper.fasta-using-bbmap.vcf minscore=20.0 overwrite=t 2> callvariants.log &

	# use bgzip and tabix to compress then index vcf file
	bgzip -f 10x-illumina-quality-controlled-and-trimmed-reads-mapped-to-jelly.out.upper.fasta-using-bbmap.vcf
	tabix -p vcf 10x-illumina-quality-controlled-and-trimmed-reads-mapped-to-jelly.out.upper.fasta-using-bbmap.vcf.gz

	# use bcftools to get consensus sequence
	bcftools consensus -f jelly.out.upper.fasta 10x-illumina-quality-controlled-and-trimmed-reads-mapped-to-jelly.out.upper.fasta-using-bbmap.vcf.gz -o jelly.out.upper.bbmap-1.fasta 2> bcftools.log &

view raw Running-PBJelly-example-with-indel-correction-with-BBMap-no-Pilon.txt hosted with ❤ by GitHub

. I also have another gist if you want to some instructions to run Pilon (

	#! /bin/bash

	set -e

	# installing fasta-splitter.pl
	## wget http://kirill-kryukov.com/study/tools/fasta-splitter/files/fasta-splitter-0.2.6.zip
	## unzip fasta-splitter-0.2.6.zip

	# assumes initial genome to be error-corrected by pilon is called
	## genome.pilon-0.fasta

	# set variables
	# RAM in gigabytes
	RAM=350
	interleavedReadsPath=/genetics/pacbio/pilon-vs-racon/chr20.pbjelly.fastq.gz
	workDir=/genetics/pacbio/pilon-vs-racon/pilon
	fastaSplitterPath=/genetics/pacbio/fasta-splitter.pl
	samtoolsPath=samtools
	bwaPath=bwa
	pilonPath=/opt/pilon/pilon-1.22.jar
	seqtkPath=/opt/seqtk/seqtk
	# number of cores on computer
	numThreads=75

	# start of script
	x=0
	while [ $x -lt 2 ]
	do
	y=$(($x+1))

	### a get the contig lengths with ${samtoolsPath} faidx
	cd ${workDir}
	${samtoolsPath} faidx genome.pilon-${x}.fasta

	### b split genome into 39 parts with fasta-splitter.pl
	perl ${fastaSplitterPath} --n-parts 39 genome.pilon-${x}.fasta --out-dir pilon-${y}/
	cd ${workDir}/pilon-${y}

	### c create a sequence from 01,02,...,39
	seq -w 1 39 > samples

	### d get the contig names
	while read i;do grep ">" genome.pilon-${x}.part-${i}.fasta\|perl -pe "s/>//g" > targetlist${i}; done < samples

	### e get the contig lengths
	cut -f 1-2 ../genome.pilon-${x}.fasta.fai > pilon.chromosome.lengths.txt

	### f get the names of the contigs that are in split files where there are less than 10 contigs
	cp samples samples-to-split

	### g write script to make target lists for pilon
	### The script below breaks up the task into ${numThreads} pieces
	while read i
	do
	lines="$(wc -l targetlist${i}\|cut -d " " -f 1)"
	while read line;
	do
	chrlength="$(grep -P "^$line\t" pilon.chromosome.lengths.txt \| cut -f 2)"
	numberofsplits="$(perl -w -e "use POSIX; print ceil(${numThreads}/$lines)")"
	numberofsplits2=$(($numberofsplits - 1))
	splitlength="$(perl -w -e "use POSIX; print ceil($chrlength/$numberofsplits)")"
	splits=1
	while [[ $splits -lt $numberofsplits ]]
	do
	if [[ $splits -eq 1 ]]
	then
	splitlength3="$(($splitlength + 1))"
	stopbase=$(($splitlength * ($splits + 1)))
	echo "${line}:1-${splitlength}" >> targetlist-${i}
	echo "${line}:${splitlength3}-${stopbase}" >> targetlist-${i}
	((splits++))
	elif [[ $splits -eq $numberofsplits2 ]]
	then
	startbase=$((($splitlength * ($splits)) + 1))
	echo "${line}:${startbase}-${chrlength}" >> targetlist-${i}
	((splits++))
	else
	stopbase=$(($splitlength * ($splits + 1)))
	startbase=$((($splitlength * $splits) + 1))
	echo "${line}:${startbase}-${stopbase}" >> targetlist-${i}
	((splits++))
	fi
	done
	done < targetlist${i}
	done < samples-to-split

	### h rename targetlist-01 to targetlist01
	while read i;do
	mv -f targetlist-${i} targetlist${i}
	done < samples-to-split

	### i make a list of the targetlists
	while read i;do
	ls targetlist${i} >> targetlists
	done < samples

	### j bwa index, bwa mem, sam2bam, samtools sort, samtools index
	cd ${workDir}
	${bwaPath} index -a bwtsw genome.pilon-${x}.fasta > bwa-index.log 2>&1
	${bwaPath} mem -p -M -t ${numThreads} genome.pilon-${x}.fasta \
	${interleavedReadsPath} 2> bwa.log \|\
	${samtoolsPath} view -@${numThreads} -F 4 -Sb - \|${samtoolsPath} sort -@${numThreads} - > reads-mapped-to-genome.pilon-${x}.fasta.bam
	${samtoolsPath} index -@${numThreads} reads-mapped-to-genome.pilon-${x}.fasta.bam

	### k run pilon (takes about 1-2 days)
	cd ${workDir}/pilon-${y}
	while read i;do
	java -Xmx${RAM}g -jar ${pilonPath} \
	--genome ../genome.pilon-${x}.fasta \
	--frags ../reads-mapped-to-genome.pilon-${x}.fasta.bam \
	--output ${i}-pilon \
	--targets ${i} \
	--diploid \
	--changes \
	--fix bases --threads ${numThreads} > ${i}-pilon.log 2>&1
	done < targetlists

	### l combine output files
	#### i get rid of extra headers in each output file
	while read i;do
	${seqtkPath} seq -l0 targetlist${i}-pilon.fasta \| awk '!seen[$1]++'\|${seqtkPath} seq -l80 > targetlist${i}-pilon.fasta2
	done < samples

	#### ii combine the output files
	cat $(find ./ -name "targetlist*-pilon.fasta2" \| sort -V) > pilon.fasta

	#### iii change output file name and remove "_pilon" from end of contig/scaffold names
	${seqtkPath} randbase pilon.fasta \| ${seqtkPath} seq -U > ../genome.pilon-${y}.fasta
	perl -pi -e "s/_pilon//g" ../genome.pilon-${y}.fasta

	### m increase loop value
	((x++))
	done

view raw pilon-runs-1-2.sh hosted with ❤ by GitHub

, but I have found that BBTool's variant caller CallVariants.sh) did a better job in terms of BUSCO scores and RNA-Seq mapping rates than running Pilon.

ADD REPLY • link 6.6 years ago by jean.elbers ★ 1.7k