Hello!
I am having some trouble running CNVkit 0.9.7, installed via conda in a dedicated environment, on a set of 10 control and 16 experimental (tumor) libraries which were prepared using a targeted amplicon panel. The batch command I am using is:
cnvkit.py batch -m amplicon $tumor_dir*.bam --normal $normal_dir*.bam \
-p 0 \
--targets $bed --fasta $hg19 \
--annotate $refFlat \
--scatter --diagram \
--output-reference reference.cnn --output-dir $out_dir
Because mode is 'amplicon' I am omitting the --access flag, but have also tried with --access using the targets bed file and the result is the same.
The first issue I encounter is that 116 bins failed filters, but looking at the log2 and spread they are within limits. Below are the first few examples - the first clearly exceeds the spread threshold, but the rest are within limits.
Targets: 116 (11.5538%) bins failed filters (log2 < -5.0, log2 > 5.0, spread > 1.0)
- chr1:36167979-36168099 log2=-4.026 spread=1.154 rmask=nan
" chr1:53307856-53307973 log2=-0.132 spread=0.078 rmask=nan
PTGER3 chr1:71477262-71477341 log2=-0.703 spread=0.113 rmask=nan
FUBP1 chr1:78432623-78432745 log2=-0.588 spread=0.146 rmask=nan
" chr1:78432750-78432869 log2=0.467 spread=0.105 rmask=nan
" chr1:78433305-78433419 log2=0.069 spread=0.095 rmask=nan
" chr1:78433832-78433916 log2=-0.269 spread=0.159 rmask=nan
The second issue I encounter is that executing the pipeline does not generate the .cns files or scatter and diagram plots. For simplicity, processing a single sample with the reference generated using the command above, the output in the CLI is:
$ ./run_cnvkit_noref.sh reference.cnn ../bam/tumor_bam/test_case test
CNVkit 0.9.7
Created directory test
Wrote test/reference.target-tmp.bed with 1004 regions
Wrote test/reference.antitarget-tmp.bed with 0 regions
Running 1 samples in 12 processes (that's 12 processes per bam)
Running the CNVkit pipeline on ../bam/tumor_bam/test_case.bam ...
Processing reads in test_case.bam
Time: 38.627 seconds (76320 reads/sec, 26 bins/sec)
Summary: #bins=1004, #reads=2948012, mean=2936.2671, min=26.099290780141843, max=39176.7304964539
Percent reads in regions: 78.873 (of 3737691 mapped)
Wrote test/test_case.targetcoverage.cnn with 1004 regions
Skip processing test_case.bam with empty regions file test/reference.antitarget-tmp.bed
Wrote test/test_case.antitargetcoverage.cnn with 0 regions
Processing target: test_case
Keeping 888 of 1004 bins
Correcting for GC bias...
Processing antitarget: test_case
/home/erik/anaconda3/envs/cnvkit/lib/python3.8/site-packages/numpy/lib/nanfunctions.py:1111: RuntimeWarning: Mean of empty slice
return np.nanmean(a, axis, out=out, keepdims=keepdims)
/home/erik/anaconda3/envs/cnvkit/lib/python3.8/site-packages/numpy/lib/nanfunctions.py:1111: RuntimeWarning: Mean of empty slice
return np.nanmean(a, axis, out=out, keepdims=keepdims)
Wrote test/test_case.cnr with 888 regions
Segmenting test/test_case.cnr ...
Segmenting with method 'cbs', significance threshold 0.0001, in 12 processes
Of course, because this is an amplicon based targeted panel, there are no antitargets, so the test_case.antitargetcoverage.cnn file consists of only a header. The output directory contains only the following:
test_case.antitargetcoverage.cnn
test_case.cnr
test_case.targetcoverage.cnn
reference.antitarget-tmp.bed
reference.target-tmp.bed
No .cns file or PDFs of plots.
I tried running version 0.9.6 in its own environment to see if that resolves either of these issues, but when I run the cnvkit.py command I get an immediate error regarding biopython:
$ cnvkit.py -h
/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/skgenome/intersect.py:15: FutureWarning: pandas.core.index is deprecated and will be removed in a future version. The public classes are available in the top-level namespace.
from pandas.core.index import Int64Index
Traceback (most recent call last):
File "/home/erik/anaconda3/envs/cnvkit_0.96/bin/cnvkit.py", line 8, in <module>
from cnvlib import commands
File "/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/cnvlib/__init__.py", line 5, in <module>
from .commands import *
File "/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/cnvlib/commands.py", line 32, in <module>
from . import (access, antitarget, autobin, batch, call, core, coverage,
File "/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/cnvlib/autobin.py", line 11, in <module>
from . import coverage, samutil
File "/home/erik/anaconda3/envs/cnvkit_0.96/lib/python3.7/site-packages/cnvlib/coverage.py", line 15, in <module>
from Bio._py3k import StringIO
ModuleNotFoundError: No module named 'Bio._py3k'
Any assistance appreciated!
-Erik
EDIT
The filtering of bins appears to be related to the use of the --fasta flag, which encodes the GC content of the bins in the .cnn files. Extracting the failed bins from the reference.cnn file and sorting by GC values there is a gap between 0.378788 and 0.619835 suggesting a cutoff around < 0.4 and > 0.6.