Hello everyone, I am currently using the Docker image for PureCN and this is my workflow. Could anyone advise me on whether this workflow is right? The reason I ask this is because the results generated using this workflow are significantly different from the results I get from using Sequenza for 10 paired tumor-germline samples:
#Step 1 - Generating interval files
Rscript $PURECN/IntervalFile.R --in-file $BAIT_COORDINATES \
--fasta $REF_GENOME_b37 --out-file $OUT_REF/baits_hg19_intervals.txt \
--off-target --genome hg19 \
--export $OUT_REF/baits_optimized_hg19.bed \
--mappability $MAPPABILITY_FILE_B37
#Step 2 - Run mutect on 10 germline samples, create genomics db and use CreateSomaticPanelOfNormals
> $PURECN_G/G_sample_map
count=1
while IFS='' read -r line
do
if [ ! -z $GERMLINE_PFX ] && [ ! -f $PURECN_G/${GERMLINE_PFX}.vcf.gz ] && [ $count -le 10 ]; then
count=$((count+1))
echo $GERMLINE_PFX
#For each germline bam files
echo "Running Mutect2 on germline bams.................."
#Run mutect on 10 germline samples
gatk Mutect2 \
-R $REF_GENOME_b37 \
-I $SAVEROOT/$TUMOR_PFX/recal/bqsr_${GERMLINE_PFX}*.bam \
--max-mnp-distance 0 \
--genotype-germline-sites true \
--genotype-pon-sites true \
--interval-padding 75 \
--L $BEDFILE_0B \
-O $PURECN_G/${GERMLINE_PFX}.vcf.gz
#Create a map file of all germline vcfs
echo -e "$GERMLINE_PFX\t$PURECN_G/${GERMLINE_PFX}.vcf.gz" >> $PURECN_G/G_sample_map
fi
done<$LIST_OF_SAMPLES
#Create a Genomics DB from the normal Mutect2 calls
gatk GenomicsDBImport \
-R $REF_GENOME_b37 \
-L $BEDFILE_0B \
--genomicsdb-workspace-path $PURECN_G/pon_db \
--sample-name-map $PURECN_G/G_sample_map
#Combine the normal calls using CreateSomaticPanelOfNormals
gatk CreateSomaticPanelOfNormals \
-R $REF_GENOME_b37 \
--germline-resource $GERMLINE_RESOURCE \
-V gendb://$PURECN_G/pon_db \
-O $PURECN_G/pon.vcf.gz
#Step 3 - Run Mutect on unmatched tumor samples
while IFS='' read -r line
do
if [ ! -z $TUMOR_PFX ] && [ ! -f $PURECN_T/${TUMOR_PFX}.vcf.gz ]; then
#Run Mutect2 for each tumor sample
gatk Mutect2 \
-R $REF_GENOME_b37 \
-I $SAVEROOT/$TUMOR_PFX/recal/bqsr_${TUMOR_PFX}*.bam \
-pon $PURECN_G/pon.vcf.gz \
--germline-resource $GERMLINE_RESOURCE\
--genotype-germline-sites true \
--genotype-pon-sites true \
--interval-padding 75 \
-L $BEDFILE_0B \
-O $PURECN_T/${TUMOR_PFX}.vcf.gz
fi
done<$LIST_OF_SAMPLES
#Step 4 - calculate gc normalized coverages for germlines
count=1
while IFS='' read -r line
do
TUMOR_PFX=$(echo $line| awk -F'|' '{print $1}')
GERMLINE_PFX=$(echo $line| awk -F'|' '{print $2}')
if [ ! -z $GERMLINE_PFX ] && [ $count -le 10 ] && [ ! -f $PURECN_G/${GERMLINE_PFX}_coverage_loess.txt.gz ] ; then
count=$((count+1))
echo $GERMLINE_PFX
Rscript $PURECN/Coverage.R \
--out-dir $PURECN_G \
--bam $SAVEROOT/$TUMOR_PFX/recal/bqsr_${GERMLINE_PFX}*.bam \
--intervals $OUT_REF/baits_hg19_intervals.txt
fi
done<$LIST_OF_SAMPLES
#Step 5 - calculate gc normalized coverages for tumors
while IFS='' read -r line
do
TUMOR_PFX=$(echo $line| awk -F'|' '{print $1}')
if [ ! -z $TUMOR_PFX ] && [ ! -f $PURECN_T/${TUMOR_PFX}_coverage_loess.txt.gz ]; then
echo $TUMOR_PFX
Rscript $PURECN/Coverage.R \
--out-dir $PURECN_T \
--bam $SAVEROOT/$TUMOR_PFX/recal/bqsr_${TUMOR_PFX}*.bam \
--intervals $OUT_REF/baits_hg19_intervals.txt
fi
done<$LIST_OF_SAMPLES
#Step 6 - create normal db
ls -a $PURECN_G/*_loess.txt.gz | cat > $PURECN_G/normal_coverages.list
ulimit -n 5000
#For a Mutect2/GATK4 normal panel GenomicsDB (beta)
Rscript $PURECN/NormalDB.R --out-dir $PURECN_G \
--coverage-files $PURECN_G/normal_coverages.list \
--normal-panel $PURECN_G/pon_db \
--genome hg19
#Step 7 - calculate tumor cellularity and ploidy
while IFS='' read -r line
do
TUMOR_PFX=$(echo $line| awk -F'|' '{print $1}')
if [ ! -z $TUMOR_PFX ] && ! grep $TUMOR_PFX $PURECN_RESULTS/purecn_unpaired_tumor_cellularity_ploidy.txt; then
echo $TUMOR_PFX
#For each tumor bam files
# Production pipeline run (dont need stats file if Mutect2 was run)
Rscript $PURECN/PureCN.R --out $PURECN_T \
--tumor $PURECN_T/bqsr_${TUMOR_PFX}_T_coverage_loess.txt.gz \
--sampleid $TUMOR_PFX \
--vcf $PURECN_T/${TUMOR_PFX}.vcf.gz \
--fun-segmentation PSCBS \
--normaldb $PURECN_G/normalDB_hg19.rds \
--mapping-bias-file $PURECN_G/mapping_bias_hg19.rds \
--intervals $OUT_REF/baits_hg19_intervals.txt \
--genome hg19 \
--model betabin \
--force --post-optimize --seed 123
My assay is targeted capture with an off target read of 30% and but if I specify --off-target in Step 1, in Step 6, I see an error:
INFO [2022-07-12 03:14:00] Removing 15793 intervals with low coverage in normalDB.
FATAL [2022-07-12 03:14:00] No intervals left after coverage checks.
FATAL [2022-07-12 03:14:00]
FATAL [2022-07-12 03:14:00] This is most likely a user error due to invalid input data or
FATAL [2022-07-12 03:14:00] parameters (PureCN 2.2.0).
Error: No intervals left after coverage checks.
This is most likely a user error due to invalid input data or
parameters (PureCN 2.2.0).
Execution halted
If I exclude --off-target in Step 1, the pipeline runs but produces different cellularity and ploidy estimates from that produced by Sequenza.
Hoping to get some idea on whether I am following the right steps for my pipeline, thank you!