Question

CNVkit 0.9.7: problems with incorrectly filtered bins, batch command outputs

0

Entering edit mode

4.2 years ago

erikt ▴ 30

Hello!

I am having some trouble running CNVkit 0.9.7, installed via conda in a dedicated environment, on a set of 10 control and 16 experimental (tumor) libraries which were prepared using a targeted amplicon panel. The batch command I am using is:

cnvkit.py batch -m amplicon $tumor_dir*.bam --normal $normal_dir*.bam \
-p 0 \
--targets $bed --fasta $hg19 \
--annotate $refFlat \
--scatter --diagram \
--output-reference reference.cnn --output-dir $out_dir

Because mode is 'amplicon' I am omitting the --access flag, but have also tried with --access using the targets bed file and the result is the same.

The first issue I encounter is that 116 bins failed filters, but looking at the log2 and spread they are within limits. Below are the first few examples - the first clearly exceeds the spread threshold, but the rest are within limits.

Targets: 116 (11.5538%) bins failed filters (log2 < -5.0, log2 > 5.0, spread > 1.0)
-                   chr1:36167979-36168099    log2=-4.026  spread=1.154  rmask=nan
"                 chr1:53307856-53307973    log2=-0.132  spread=0.078  rmask=nan
PTGER3              chr1:71477262-71477341    log2=-0.703  spread=0.113  rmask=nan
FUBP1               chr1:78432623-78432745    log2=-0.588  spread=0.146  rmask=nan
"                 chr1:78432750-78432869    log2=0.467  spread=0.105  rmask=nan
"                 chr1:78433305-78433419    log2=0.069  spread=0.095  rmask=nan
"                 chr1:78433832-78433916    log2=-0.269  spread=0.159  rmask=nan

The second issue I encounter is that executing the pipeline does not generate the .cns files or scatter and diagram plots. For simplicity, processing a single sample with the reference generated using the command above, the output in the CLI is:

$ ./run_cnvkit_noref.sh reference.cnn ../bam/tumor_bam/test_case test
CNVkit 0.9.7
Created directory test
Wrote test/reference.target-tmp.bed with 1004 regions
Wrote test/reference.antitarget-tmp.bed with 0 regions
Running 1 samples in 12 processes (that's 12 processes per bam)
Running the CNVkit pipeline on ../bam/tumor_bam/test_case.bam ...
Processing reads in test_case.bam
Time: 38.627 seconds (76320 reads/sec, 26 bins/sec)
Summary: #bins=1004, #reads=2948012, mean=2936.2671, min=26.099290780141843, max=39176.7304964539 
Percent reads in regions: 78.873 (of 3737691 mapped)
Wrote test/test_case.targetcoverage.cnn with 1004 regions
Skip processing test_case.bam with empty regions file test/reference.antitarget-tmp.bed
Wrote test/test_case.antitargetcoverage.cnn with 0 regions
Processing target: test_case
Keeping 888 of 1004 bins
Correcting for GC bias...
Processing antitarget: test_case
/home/erik/anaconda3/envs/cnvkit/lib/python3.8/site-packages/numpy/lib/nanfunctions.py:1111: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
/home/erik/anaconda3/envs/cnvkit/lib/python3.8/site-packages/numpy/lib/nanfunctions.py:1111: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
Wrote test/test_case.cnr with 888 regions
Segmenting test/test_case.cnr ...
Segmenting with method 'cbs', significance threshold 0.0001, in 12 processes

Of course, because this is an amplicon based targeted panel, there are no antitargets, so the test_case.antitargetcoverage.cnn file consists of only a header. The output directory contains only the following:

test_case.antitargetcoverage.cnn
test_case.cnr
test_case.targetcoverage.cnn
reference.antitarget-tmp.bed
reference.target-tmp.bed

No .cns file or PDFs of plots.

I tried running version 0.9.6 in its own environment to see if that resolves either of these issues, but when I run the cnvkit.py command I get an immediate error regarding biopython:

$ cnvkit.py -h
/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/skgenome/intersect.py:15: FutureWarning: pandas.core.index is deprecated and will be removed in a future version.  The public classes are available in the top-level namespace.
  from pandas.core.index import Int64Index
Traceback (most recent call last):
  File "/home/erik/anaconda3/envs/cnvkit_0.96/bin/cnvkit.py", line 8, in <module>
    from cnvlib import commands
  File "/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/cnvlib/__init__.py", line 5, in <module>
    from .commands import *
  File "/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/cnvlib/commands.py", line 32, in <module>
    from . import (access, antitarget, autobin, batch, call, core, coverage,
  File "/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/cnvlib/autobin.py", line 11, in <module>
    from . import coverage, samutil
  File "/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/cnvlib/coverage.py", line 15, in <module>
    from Bio._py3k import StringIO
ModuleNotFoundError: No module named 'Bio._py3k'

Any assistance appreciated!

-Erik

EDIT

The filtering of bins appears to be related to the use of the --fasta flag, which encodes the GC content of the bins in the .cnn files. Extracting the failed bins from the reference.cnn file and sorting by GC values there is a gap between 0.378788 and 0.619835 suggesting a cutoff around < 0.4 and > 0.6.

cnvkit cnv targeted amplicon panel • 1.4k views

ADD COMMENT • link 4.2 years ago by erikt ▴ 30