purecn with cnvkit
2
0
Entering edit mode
4.0 years ago
ww22runner ▴ 60

Hello everyone,

when I try the last step of purecn using cnvkit outputs:

                    $RSCRIPT_PATH $PURECN/PureCN.R \
                    --out  $CNV_T_RESULTS/${TUMOR_PFX}  \
                    --sampleid ${TUMOR_PFX}\
                    --tumor $CNV_T_RESULTS/${TUMOR_PFX}_T.cnr \
                    --segfile $CNV_T_RESULTS/${TUMOR_PFX}.seg \
                    --vcf $CNV_T_RESULTS/${TUMOR_PFX}.vcf.gz \
                    --genome hg19 \
                    --force --postoptimize --seed 123

I get this error:

> log2-ratio in tumor coverage data. WARN [2020-11-10 23:04:51] Allosome
> coverage missing, cannot determine sex. WARN [2020-11-10 23:04:51]
> Allosome coverage missing, cannot determine sex. INFO [2020-11-10
> 23:04:51] Using 8016 intervals (8016 on-target, 0 off-target). INFO
> [2020-11-10 23:04:51] No off-target intervals. If this is
> hybrid-capture data, consider adding them. INFO [2020-11-10 23:04:51]
> Loading VCF... INFO [2020-11-10 23:04:51] Found 837 variants in VCF
> file. INFO [2020-11-10 23:04:51] Removing 150 triallelic sites. WARN
> [2020-11-10 23:04:52] vcf.file has no DB info field for membership in
> germline databases. Found and used valid population allele frequency >
> 0.001000 instead. INFO [2020-11-10 23:04:52] 447 (65.1%) variants annotated as likely germline (DB INFO flag). INFO [2020-11-10
> 23:04:53] AR2T.060A is tumor in VCF file. INFO [2020-11-10 23:04:53] 0
> homozygous and 4 heterozygous variants on chrX. INFO [2020-11-10
> 23:04:53] Sex from VCF: F (Fisher's p-value: 0.32, odds-ratio: 0.00).
> INFO [2020-11-10 23:04:53] Detected MuTect2 VCF. INFO [2020-11-10
> 23:04:53] Removing 0 MuTect2 calls due to blacklisted failure reasons.
> INFO [2020-11-10 23:04:54] Initial testing for significant sample
> cross-contamination: unlikely INFO [2020-11-10 23:04:55] Removing 332
> variants with AF < 0.030 or AF >= 0.970 or less than 4 supporting
> reads or depth < 15. Error in which(as.numeric(geno(vcf)$MBQ[,
> tumor.id.in.vcf]) >= min.base.quality) :   (list) object cannot be
> coerced to type 'double' Calls: runAbsoluteCN ... filterVcfMuTect2 ->
> filterVcfBasic -> .filterVcfByBQ -> which Execution halted

I used Mutect2, GATK4 and purecn 1.12.2. Can anyone advice me on what to do? Thank you

cnvkit purecn • 3.8k views
ADD COMMENT
0
Entering edit mode
4.0 years ago

Can you update to the latest version 1.20.0? See https://github.com/lima1/PureCN#installation

ADD COMMENT
0
Entering edit mode

Hi Markus,

Thank you for your reply. I do not have the option of upgrading my current R version which is 3.5.1 and I can only install pureCN 1.12.2 on this version of R. Is it possible to work around this? I am also not clear on whether my workflow is correct. I would like to run PureCN with Mutect2, GATK4 and integrate cnvkit results with PureCN.

Currently what I do is I create a pooled reference by: 1) running each germline (unmatched but process matched) sample in Mutect2 tumor mode, and then run SomaticPanelOFNormals on this.

$GATK Mutect2 \
 -R $REF_GENOME_b37 \
 -I  ${GERMLINE_PFX}_G.bam \
 -tumor $GERMLINE_PFX\
 --germline-resource af-only-gnomad.raw.sites.vcf \
 -O  ${GERMLINE_PFX}_for_pon.vcf.gz

ls  *_pon.vcf.gz >> normals_for_pon_vcf.args

 $GATK CreateSomaticPanelOfNormals \
 -vcfs normals_for_pon_vcf.args \
 -O pon.vcf.gz

2) using bcftools merge to merge all the Mutect2 vcf files for the germlines to create one merged germline.vcf. I then compress using bgzip and index using tabix.

bcftools-1.9/bcftools merge $(ls *_pon.vcf.gz) -Oz -o pon.vcf.gz
bcftools-1.9/bcftools index -t pon.vcf.gz

3) I run this step:

Rscript $PURECN/NormalDB.R --outdir $OUT --normal_panel pon.vcf.gz --assay agilent_v6 --genome hg19 --force

4) I create a mapping bias file using an R script

 Rscript  create_mapping_bias_file.r

contents of create_mapping_bias_file.r:

library("PureCN")
normal.panel.vcf.file <- "pon.vcf.gz"
bias <- calculateMappingBiasVcf(normal.panel.vcf.file, genome = "h19")
saveRDS(bias, "mapping_bias.rds")

5) I run this command for every tumor sample:

cnvkit.py export seg ${TUMOR_PFX}_T.cns --enumerate-chroms -o ${TUMOR_PFX}.seg

5) I run mutect2 on every single tumor sample

$GATK Mutect2 \
 -R $REF_GENOME_b37 \
 -I ${TUMOR_PFX}_T.bam \
 -tumor $TUMOR_PFX\
 --panel-of-normals pon.vcf.gz \
 --germline-resource af-only-gnomad.raw.sites.vcf \
-O ${TUMOR_PFX}.vcf.gz

6) I run pureCN

Rscript $PURECN/PureCN.R \
                        --out  ${TUMOR_PFX}  \
                        --sampleid ${TUMOR_PFX}\
                        --tumor ${TUMOR_PFX}_T.cnr \
                        --segfile ${TUMOR_PFX}.seg \
                        --vcf ${TUMOR_PFX}.vcf.gz \
                        --genome hg19 \
                        --force --postoptimize --seed 123
                        --normaldb mapping_bias_agilent_v6_hg19.rds\

In the last step, somehow I get the error that the

seg file does not match the normal db

and if I remove that, I get the error above ((list) object cannot be

coerced to type 'double' Calls: runAbsoluteCN ... filterVcfMuTect2 -> filterVcfBasic -> .filterVcfByBQ -> which Execution halted).

I have searched extensively for documentation on how to integrate Mutect2 and GATK4 with PureCN while using cnvkit output and I could not get much information and so I thought documenting it here and asking for advice might help others in future. Thank you.

ADD REPLY
0
Entering edit mode

You can install the latest version from GitHub. If BiocManager is not available for your R version, try devtools::install_github. PureCN 1.20.0 is the only version that works with GATK4. I would also recommend updating GATK4.

ADD REPLY
0
Entering edit mode

Hi Markus,

Thank you for your reply. I am now running PureCN 1.21.0 but still see the same error in the last step:

INFO [2020-11-12 17:49:59] Loading PureCN 1.21.0... INFO [2020-11-12 17:49:59] ------------------------------------------------------------ INFO [2020-11-12 17:49:59] PureCN 1.21.0 INFO [2020-11-12 17:49:59] ------------------------------------------------------------ INFO [2020-11-12 17:49:59] Arguments: -normal.coverage.file -tumor.coverage.file Sample1_T.cnr -log.ratio -seg.file Sample1.seg -vcf.file Sample1.vcf.gz -normalDB -genome hg19 -sex ? -args.setPriorVcf 6 -args.setMappingBiasVcf mapping_bias_agilent_v6_hg19.rds -args.segmentation 0.005,NULL -sampleid Sample1 -min.ploidy 1.4 -max.ploidy 6 -max.non.clonal 0.2 -max.homozygous.loss 0.05,1e+07 -log.ratio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file -max.segments 300 -plot.cnv TRUE -vcf.field.prefix PureCN. -DB.info.flag DB -POPAF.info.field POP_AF -model beta -post.optimize TRUE -BPPARAM -log.file Sample1.log -args.filterVcf <data> -fun.segmentation <data> -test.num.copy <data> -test.purity <data> -speedup.heuristics <data> INFO [2020-11-12 17:49:59] Loading coverage files... INFO [2020-11-12 17:50:00] Found log2-ratio in tumor coverage data. WARN [2020-11-12 17:50:00] log2-ratio contains outliers < -8, ignoring them... WARN [2020-11-12 17:50:00] Allosome coverage missing, cannot determine sex. WARN [2020-11-12 17:50:00] Allosome coverage missing, cannot determine sex. INFO [2020-11-12 17:50:00] Removing 1 intervals with missing log.ratio. INFO [2020-11-12 17:50:00] Using 8015 intervals (8015 on-target, 0 off-target). INFO [2020-11-12 17:50:00] No off-target intervals. If this is hybrid-capture data, consider adding them. INFO [2020-11-12 17:50:00] Loading VCF... INFO [2020-11-12 17:50:01] Found 837 variants in VCF file. INFO [2020-11-12 17:50:01] Removing 150 triallelic sites. WARN [2020-11-12 17:50:01] vcf.file has no DB info field for membership in germline databases. Found and used valid population allele frequency > 0.001000 instead. INFO [2020-11-12 17:50:01] 447 (65.1%) variants annotated as likely germline (DB INFO flag). INFO [2020-11-12 17:50:02] Sample1 is tumor in VCF file. INFO [2020-11-12 17:50:02] 0 homozygous and 4 heterozygous variants on chrX. INFO [2020-11-12 17:50:02] Sex from VCF: F (Fisher's p-value: 0.32, odds-ratio: 0.00). INFO [2020-11-12 17:50:02] Detected MuTect2 VCF. INFO [2020-11-12 17:50:02] Removing 0 Mutect2 calls due to blacklisted failure reasons. INFO [2020-11-12 17:50:03] Initial testing for significant sample cross-contamination: unlikely INFO [2020-11-12 17:50:04] Removing 332 variants with AF < 0.030 or AF >= 0.970 or less than 4 supporting reads or depth < 15. Error in .filterVcfByBQ(vcf, tumor.id.in.vcf, min.base.quality) : (list) object cannot be coerced to type 'double' Calls: runAbsoluteCN ... filterVcfMuTect2 -> filterVcfBasic -> .filterVcfByBQ Execution halted

Could you advice on whether I might be making a mistake anywhere in the workflow I am following? Thank you!

ADD REPLY
0
Entering edit mode

Hi, I don't know if I'm correct here.

One suggestion: At step3 you have mentioned above NormalDB.R seems to be internally calling the calculateMappingBiasVcf() to create a mapping bias file. So, I think you don't need to perform the step4 using create_mapping_bias_file.r again.

Also, according to the vignette, --normaldb and --mappingbiasfile are two different argument and needs different inputs. But as per the step6 you have showed above, you seems to be using the wrong argument and input combination like, --normaldb mapping_bias_agilent_v6_hg19.rds, but I guess it supposed to be --mappingbiasfile mapping_bias_agilent_v6_hg19.rds

Actually, I just started working on this tool, not sure if I'm correct. Hope this helps.

ADD REPLY
0
Entering edit mode
4.0 years ago

Hm, it should not crash, but can you try the complete Mutect2 workflow with the latest GATK 4?https://gatk.broadinstitute.org/hc/en-us/articles/360035531132

ADD COMMENT
0
Entering edit mode

Hi Markus,

Thank you, I updated to GATK 4.1.9.0. Here are the steps that I have followed and the error I get:

1) I create a pooled reference by: 1) running each germline (unmatched but process matched) sample in Mutect2 tumor mode, and then run SomaticPanelOFNormals on this. Here I used the link you provided above.

For every germline sample:

$GATK Mutect2 \
 -R $REF_GENOME_b37 \
 -I ${GERMLINE_PFX}_G.bam \
 --max-mnp-distance 0 \
 -O ${GERMLINE_PFX}.vcf.gz

then:

$GATK GenomicsDBImport \
 -R $REF_GENOME_b37 \
-L $BEDFILE \
 --genomicsdb-workspace-path pon_db \
 -V ${GERMLINE_PFX}.vcf.gz \ #sample 1 germline
.
.
.
 -V${GERMLINE_PFX}.vcf.gz \ #sample 9 germline

and

$GATK CreateSomaticPanelOfNormals \
 -R $REF_GENOME_b37 \
 --germline-resource af-only-gnomad.raw.sites.vcf \
 -V gendb://pon_db \
 -O pon.vcf.gz

2) I run:

Rscript $PURECN/NormalDB.R --outdir $OUT --normal_panel pon.vcf.gz --assay agilent_v6 --genome hg19 --force

Here I get the error:

INFO [2020-11-13 12:48:51] Loading PureCN 1.21.0... INFO [2020-11-13 12:49:03] Creating mapping bias database. INFO [2020-11-13 12:49:04] Processing variants 1 to 5000... FATAL [2020-11-13 12:49:04] The normal.panel.vcf.file contains only a single sample.

FATAL [2020-11-13 12:49:04]

FATAL [2020-11-13 12:49:04] This is most likely a user error due to invalid input data or

FATAL [2020-11-13 12:49:04] parameters (PureCN 1.21.0).

Error: The normal.panel.vcf.file contains only a single sample.

This is most likely a user error due to invalid input data or parameters (PureCN 1.21.0). In addition: Warning message: In .vcf_usertag(map, tag, nm, verbose) : ScanVcfParam ‘geno’ fields not found in header: ‘AD’ Execution halted

Would you happen to know why this error occurs? Thank you.

ADD REPLY
0
Entering edit mode

Yes, the pon.vcf.gz unfortunately does not provide detailed enough information. Ideally, install the https://github.com/nalinigans/GenomicsDB-R. There is a shell script on GitHub that installs it (from the same author).

This will make PureCN be able to read the pon_db.

ADD REPLY
0
Entering edit mode

Thank you Markus, may I ask in "remotes::install_github("nalinigans/GenomicsDB-R", ref="<github_branch>", configure.args="--with-genomicsdb=<path_to_genomicsdb>")" what is the path to genomics db? If I only try "remotes::install_github("nalinigans/GenomicsDB-R" in R, I get the error: configure: error: GenomicsDB installation not found. Invoke with either $GENOMICSDB_HOME env set to the installation path or by explicitly specifying the path with the --with-genomicsdb argument

ADD REPLY
0
Entering edit mode

Sorry for the late response. You will also need to install the main GenomicsDB repository. Once this compiled successfully, add that path in configure.args. Follow their install instructions closely, they did a great job there.

I was planning to release a docker file with those added, but might take a few more weeks.

ADD REPLY
0
Entering edit mode

Hi, I basically met all the errors that you produced here. Did you manage it in the end? I am quite struggling with producing right vcf files and installing GenomicsDB-R.

ADD REPLY

Login before adding your answer.

Traffic: 2595 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6