Hi all,
I would like to get some input on whether I am running PureCN with CNVkit's outputs correctly. The reason I ask is because when I use 2 different sets of 10 process matched germlines, I am getting wildly different results for paired tumors' cellularity and ploidy results.
Here is what I am doing:
CNVKIT
python $CNVKIT/cnvkit.py batch $CNV_BAMS/_T.bam --normal $CNV_BAMS/_G.bam --targets $BEDFILE_0B --fasta $REF_GENOME_b37 --access $CNVKIT/data/access-5k-mappable.grch37.bed --output-reference $CNV_BAMS/my_reference.cnn --output-dir $CNV_T_RESULTS --diagram --scatter -p 8
python $CNVKIT/cnvkit.py batch $CNV_BAMS/*_G.bam -r $CNV_BAMS/my_reference.cnn -p 8 --scatter --diagram -d $CNV_G_RESULTS
PURECN WORKFLOW
1) Run Mutect2 for first 10 process matched germline sample
$GATK_PATH Mutect2 \ -R $REF_GENOME_b37 \ -I $CNV_BAMS/${GERMLINE_PFX}_G.bam \ --max-mnp-distance 0 \ -O $CNV_G_RESULTS_PURECN/${GERMLINE_PFX}.vcf.gz
2) Create a Genomics DB from the normal Mutect2 calls
$GATK_PATH GenomicsDBImport \ -R $REF_GENOME_b37 \ -L $BEDFILE_0B \ --genomicsdb-workspace-path $CNV_G_RESULTS_PURECN/pon_db \ --sample-name-map $CNV_G_RESULTS_PURECN/G_sample_map
3) Combine the normal calls using CreateSomaticPanelOfNormals
$GATK_PATH CreateSomaticPanelOfNormals \ -R $REF_GENOME_b37 \ --germline-resource $GERMLINE_RESOURCE \ -V gendb://$CNV_G_RESULTS_PURECN/pon_db \ -O $CNV_G_RESULTS_PURECN/pon.vcf.gz
4) For each tumor sample in run:
Create .seg file
python $CNVKIT/cnvkit.py export seg $CNV_T_RESULTS/${TUMOR_PFX}_T.cns --enumerate-chroms -o $CNV_T_RESULTS_PURECN/${TUMOR_PFX}.seg
Run Mutect2
$GATK_PATH Mutect2 \ -R $REF_GENOME_b37 \ -I $CNV_BAMS/${TUMOR_PFX}_T.bam \ -pon $CNV_G_RESULTS_PURECN/pon.vcf.gz \ --germline-resource $GERMLINE_RESOURCE\ --af-of-alleles-not-in-resource 0.0000025 \ --genotype-germline-sites true \ -L $BEDFILE_0B \ -O $CNV_T_RESULTS_PURECN/${TUMOR_PFX}_unmatched.vcf.g
Filter Mutect2 vcf:
$GATK_PATH FilterMutectCalls \ -R $REF_GENOME_b37 \ -V $CNV_T_RESULTS_PURECN/${TUMOR_PFX}_unmatched.vcf.gz \ -O $CNV_T_RESULTS_PURECN/${TUMOR_PFX}_filtered.vcf.gz
Run PureCN by providing the .cnr and .seg files:
$RSCRIPT_PATH $PURECN/PureCN.R \ --out $CNV_T_RESULTS_PURECN/${TUMOR_PFX} \ --sampleid ${TUMOR_PFX}\ --tumor $CNV_T_RESULTS/${TUMOR_PFX}_T.cnr \ --segfile $CNV_T_RESULTS_PURECN/${TUMOR_PFX}.seg \ --vcf $CNV_T_RESULTS_PURECN/${TUMOR_PFX}_filtered.vcf.gz \ --genome hg19 \ --force --postoptimize --seed 123 \ --funsegmentation Hclust
Versions I am using- cnvkit 0.9.5, pureCN 1.21.0, Python 2.7, R3.5.1, Mutect2 from GATK 4.1.9.0 When I use different sets of 10 germline samples from my run to test the tumor cellularity and ploidy, the cellularities I get vary more than 20%. Is there anything incorrect about the workflow?
Thank you
Can you add the PureCN log file? The paired and unpaired should return the same purity and ploidy in vast majority of cases.
Hi Dr.Riester, What I meant was that I had 12 pairs of matched tumor-germline samples in a run and for the first try, I used the first 10 germline samples to form a pooled reference while in the second time, I used the last 10 germline samples to form a pooled reference. The cellularities and ploidies for the tumors in this run were not consistent for both runs and I was wondering if my workflow/commands were incorrect.
So only a difference of 2 normals? Either way, there shouldn’t be a huge variance in purity. I would recommend to generate a mapping bias database using the GenomicsDBImport output. The Docker image has the dependencies installed, see the best practices vignette. But there are likely more issues, I would need the log files for that.