Question

PureCN docker workflow

1

Entering edit mode

2.3 years ago

biologist ▴ 20

Hello everyone, I am currently using the Docker image for PureCN and this is my workflow. Could anyone advise me on whether this workflow is right? The reason I ask this is because the results generated using this workflow are significantly different from the results I get from using Sequenza for 10 paired tumor-germline samples:

#Step 1 - Generating interval files

Rscript $PURECN/IntervalFile.R --in-file $BAIT_COORDINATES \
    --fasta $REF_GENOME_b37 --out-file $OUT_REF/baits_hg19_intervals.txt \
    --off-target --genome hg19 \
    --export $OUT_REF/baits_optimized_hg19.bed  \
    --mappability $MAPPABILITY_FILE_B37

#Step 2 - Run mutect on 10 germline samples, create genomics db and use CreateSomaticPanelOfNormals

> $PURECN_G/G_sample_map

        count=1
        while IFS='' read -r line
        do
        if [ ! -z $GERMLINE_PFX ] && [ ! -f $PURECN_G/${GERMLINE_PFX}.vcf.gz ] && [ $count -le 10 ]; then
                count=$((count+1))
                echo $GERMLINE_PFX
                #For each germline bam files
                echo "Running Mutect2 on germline bams.................."

                #Run mutect on 10 germline samples
                gatk Mutect2 \
                -R $REF_GENOME_b37 \
                -I $SAVEROOT/$TUMOR_PFX/recal/bqsr_${GERMLINE_PFX}*.bam \
                --max-mnp-distance 0 \
                --genotype-germline-sites true \
                --genotype-pon-sites true \
                --interval-padding 75 \
                --L $BEDFILE_0B \
                -O $PURECN_G/${GERMLINE_PFX}.vcf.gz

                #Create a map file of all germline vcfs
                echo -e "$GERMLINE_PFX\t$PURECN_G/${GERMLINE_PFX}.vcf.gz" >> $PURECN_G/G_sample_map
        fi

       done<$LIST_OF_SAMPLES

        #Create a Genomics DB from the normal Mutect2 calls 
        gatk GenomicsDBImport \
        -R $REF_GENOME_b37 \
        -L $BEDFILE_0B \
        --genomicsdb-workspace-path $PURECN_G/pon_db \
        --sample-name-map $PURECN_G/G_sample_map

        #Combine the normal calls using CreateSomaticPanelOfNormals
        gatk CreateSomaticPanelOfNormals \
        -R $REF_GENOME_b37 \
        --germline-resource $GERMLINE_RESOURCE \
        -V gendb://$PURECN_G/pon_db \
        -O $PURECN_G/pon.vcf.gz

#Step 3 - Run Mutect on unmatched tumor samples

while IFS='' read -r line
        do
                if [ ! -z $TUMOR_PFX ]  && [ ! -f $PURECN_T/${TUMOR_PFX}.vcf.gz ]; then
                        #Run Mutect2 for each tumor sample
                        gatk Mutect2 \
                        -R $REF_GENOME_b37 \
                        -I $SAVEROOT/$TUMOR_PFX/recal/bqsr_${TUMOR_PFX}*.bam \
                        -pon $PURECN_G/pon.vcf.gz \
                        --germline-resource $GERMLINE_RESOURCE\
                        --genotype-germline-sites true \
                        --genotype-pon-sites true \
                        --interval-padding 75 \
                        -L $BEDFILE_0B \
                        -O $PURECN_T/${TUMOR_PFX}.vcf.gz

                fi

       done<$LIST_OF_SAMPLES

#Step 4 - calculate gc normalized coverages for germlines

count=1
            while IFS='' read -r line
            do
            TUMOR_PFX=$(echo $line| awk -F'|' '{print $1}')
            GERMLINE_PFX=$(echo $line| awk -F'|' '{print $2}')
            if [ ! -z $GERMLINE_PFX ] && [ $count -le 10 ] && [ ! -f $PURECN_G/${GERMLINE_PFX}_coverage_loess.txt.gz ] ; then
                    count=$((count+1))
                    echo $GERMLINE_PFX

                    Rscript $PURECN/Coverage.R \
                    --out-dir $PURECN_G \
                    --bam $SAVEROOT/$TUMOR_PFX/recal/bqsr_${GERMLINE_PFX}*.bam \
                    --intervals $OUT_REF/baits_hg19_intervals.txt
            fi

           done<$LIST_OF_SAMPLES

#Step 5 - calculate gc normalized coverages for tumors

while IFS='' read -r line
        do
        TUMOR_PFX=$(echo $line| awk -F'|' '{print $1}')
        if [ ! -z $TUMOR_PFX ] && [ ! -f $PURECN_T/${TUMOR_PFX}_coverage_loess.txt.gz ]; then
                echo $TUMOR_PFX
                Rscript $PURECN/Coverage.R \
                --out-dir $PURECN_T \
                --bam $SAVEROOT/$TUMOR_PFX/recal/bqsr_${TUMOR_PFX}*.bam \
                --intervals $OUT_REF/baits_hg19_intervals.txt
        fi

        done<$LIST_OF_SAMPLES

#Step 6 - create normal db

ls -a $PURECN_G/*_loess.txt.gz | cat > $PURECN_G/normal_coverages.list
        ulimit -n 5000
        #For a Mutect2/GATK4 normal panel GenomicsDB (beta)
        Rscript $PURECN/NormalDB.R --out-dir $PURECN_G \
        --coverage-files $PURECN_G/normal_coverages.list \
        --normal-panel $PURECN_G/pon_db \
        --genome hg19

#Step 7 - calculate tumor cellularity and ploidy

while IFS='' read -r line
        do
                TUMOR_PFX=$(echo $line| awk -F'|' '{print $1}')

        if [ ! -z $TUMOR_PFX ] && ! grep $TUMOR_PFX $PURECN_RESULTS/purecn_unpaired_tumor_cellularity_ploidy.txt; then
                echo $TUMOR_PFX
                #For each tumor bam files
                # Production pipeline run (dont need stats file if Mutect2 was run)
                Rscript $PURECN/PureCN.R --out $PURECN_T \
                --tumor $PURECN_T/bqsr_${TUMOR_PFX}_T_coverage_loess.txt.gz \
                --sampleid $TUMOR_PFX \
                --vcf $PURECN_T/${TUMOR_PFX}.vcf.gz \
                --fun-segmentation PSCBS \
                --normaldb $PURECN_G/normalDB_hg19.rds \
                --mapping-bias-file $PURECN_G/mapping_bias_hg19.rds \
                --intervals $OUT_REF/baits_hg19_intervals.txt \
                --genome hg19 \
                --model betabin \
                --force --post-optimize --seed 123

My assay is targeted capture with an off target read of 30% and but if I specify --off-target in Step 1, in Step 6, I see an error:

INFO [2022-07-12 03:14:00] Removing 15793 intervals with low coverage in normalDB.
FATAL [2022-07-12 03:14:00] No intervals left after coverage checks.
FATAL [2022-07-12 03:14:00]
FATAL [2022-07-12 03:14:00] This is most likely a user error due to invalid input data or
FATAL [2022-07-12 03:14:00] parameters (PureCN 2.2.0).
Error: No intervals left after coverage checks.
This is most likely a user error due to invalid input data or
parameters (PureCN 2.2.0).
Execution halted

If I exclude --off-target in Step 1, the pipeline runs but produces different cellularity and ploidy estimates from that produced by Sequenza.

Hoping to get some idea on whether I am following the right steps for my pipeline, thank you!

purecn • 692 views

ADD COMMENT • link 2.3 years ago by biologist ▴ 20