Dear all, I am trying to use CNVkit to detect copy number variant in my samples. My sequencing data is derived from peripheral blood so I am looking for germline copy number variants. I don't have tumor sample, only germline.
I followed the suggestions of the documentation and here is the flow of the pipeline that I used to analyse my data:
python3 /usr/local/cluster/bin/cnvkit.py batch germline_sample.bam --n --targets brca_slop.bed --fasta hg19.fa --output-reference germline_sample.cnn --output-dir cnvkit_brca/
python3 /usr/local/cluster/bin/cnvkit.py segmetrics -s germline_sample.cn{s,r} --ci
python3 /usr/local/cluster/bin/cnvkit.py call germline_sample.segmetrics.cns --filter ci -o germline_sample.calls.cns
As test I used a sample that I know has a deletion of the exon 23 in BRCA1 gene. But I did not found any result:
python3 /usr/local/cluster/bin/cnvkit.py breaks germline_sample.cnr germline_sample.calls.cns
Found 0 gene breakpoints
gene chromosome location change probes_left probes_right
python3 /usr/local/cluster/bin/cnvkit.py gainloss germline_sample.cnr -s germline_sample.calls.cns
*WARNING* No chrX found in probes; check the input
Treating sample gremline_sample as male
*WARNING* No chrX found in probes; check the input
*WARNING* No chrX found in probes; check the input
*WARNING* No chrX found in probes; check the input
Found 2 gene-level gains and losses
gene chromosome start end log2 depth weight cn probes
BRCA2 chr13 32890547 32972958 -0.243229 418.202 36.7767 2 269
BRCA1 chr17 41197644 41276164 -0.795891 284.416 22.3257 1 309
Definitely I am missing something. Is my flow correct? Can someone help me in understand what is wrong and how I have to use CNVkit on my data? Thanks for the help.
Stefania
Hi Eric, Many thanks for your answer. I am not using whole exome data but I have shorter sequencing panel. I tried to analyse data as you suggested creating a pooled reference using the other samples as normal samples (in total 12 samples)
Then I used your new script to investigate log2 value, but I obtained 0 significant results. I also tried increasing the threshold with -a argument:
Do you think that it is a problem related with the limited dimension of the sequencing panel? Or, as you already suggested, the deletion is too small to be found by CNVkit. If CNVkit is not the correct tool to use in my case, do you have any suggestion about other tools?
Thank you so much for the help.
Stefania
The small panel size shouldn't be a problem. But this test isn't especially sensitive. A couple questions:
Was this sequencing done by hybrid capture or targeted amplicon sequencing? The small panel size and small number of off-target bins suggests the latter. If so, use the
batch
command option--method amplicon
to reduce the chance of errata in your output files.What do you see if you use an unreasonably high alpha, e.g.
-a 0.9
? The output .cnr file will have an extra field calledztest
, listing the p-values for each bin. If these are all 1.0, then something went wrong -- check brca.reference.cnn to see if thespread
column there is all zeros, because that would mean the pooled reference was not built properly and the observed variance at each bin was not recorded for the Z-test. If you do see decimal p-values in your output .cnr file, you can jump to BRCA1 exon 23 and see the p-value there.Note that this script adjusts p-values for multiple hypothesis tests, so it's possible that the log2 read depth ratio at BRCA1 exon 23 was not significantly below the range of normal samples at that site. You can do some of the math by hand: in brca.reference.cnn, find the bin of interest, then look at the values in the
log2
andspread
columns. If thelog2
value in your test sample was less than the corresponding referencelog2
value minus twice thespread
value, then the Z-test has a chance of picking it up, otherwise the script will probably miss it.One more thought: Since BRCA1 is already detected as hemizygous in this sample, you can use the
-s
option tocnv_ztest.py
to make the p-values relative to the surrounding segment, rather than to the genome-wide neutral copy number value (i.e. diploid). That will actually make the test less sensitive to a deep deletion within BRCA1, but in any case, it's relevant.You should still be able to see the BRCA1 exon 23 in a scatter plot:
If it's not showing up there, that bin may have been filtered out earlier in the pipeline for reasons that may be clear if you post the corresponding row in brca.reference.cnn here.
Hi Eric, sorry for my late answer!
I did again the analysis, this time using the whole sequencing panel and not only the BRCA regions. I also renamed the germline.sample file with the code 1294-15. Here are my commands:
I looked at 1294-15_S3.reference and the spread column in the BRCA1 region, values goes from 0.07 to 0.23. This are the rows in the 1294-15.reference:
The bin of interest is this one:
As you can see the dimension of the exon is 226 bp, probably as you suggested before it is too small to be detected by the method. Otherwise I was wondering if could be a coverage problem even if it is high because it is about 200X. I also uploaded the scatter plot: Following the documentation and your suggestions, the flow of the analysis seems correct to me. What do you think? Is there something that I am missing? At the moment I don't have other known samples to use as test for the analysis pipeline, but if you think that it is correct I will try this pipeline using all the samples we sequenced and see if something came up. Thanks for your support and sorry for the late answer!
It looks like your link to the BRCA1 scatter plot didn't work, could you post the link directly?
In your sample .cnr file, can you post the bin corresponding to exon 23 (matching the reference.cnn bin you posted)? I notice that in the reference, the baseline log2 value is about -.23 and the spread is 0.55, so as a rule of thumb the log2 value of a bin to be flagged by the z-test would need to be below (-.23 - 2*.55) = -1.33. With the multiple-test correction for 1942 targeted bins in your panel, it needs to be far below that. Is it?
As a sanity check, you can also look at the "depth" column relative to your overall on-target coverage (~200x). Hemizygous regions in a germline ought to be on the order of 100x coverage, while homozygous deletions should be 0 or very close to it. Do you see that in your data?
Hi Eric, here is the link of the scatter plot. I hope it now works. https://ibb.co/d1Km1G Here is the bin in the sample.cnr file
As you can see the log2 value is -0.12 and the coverage reported is quite high (439). It seems that the data is not suggesting the deletion, even if we know it is there... I also tried to use the gainloss as suggested in the doc to find the list of trusted genes :
Does this means that actually cnvkit is finding something on BRCA1? If so, is the corresponding bin the one returned by
If yes, it seems to be the result that I expected but why breaks does not find anything? Sorry to bother you but can you please better clarify the columns of gainloss result and how to interpret the scatter plot? I want to apply CNVkit to another set of data in which I do not know anything so I am trying to define a general pipeline of analysis and how to interpret the result of CNVkit to go further with investigation. Is there a sort of "rule" that I can apply that can help me in finding potential interesting bins?
Many many thanks for the help and the support.
Stefania
The
gainloss
command doesn't do special calculations, it just slices up the .cnr and .cns files so that bins are grouped by gene name -- see here. The default threshold is a permissive log2 amplitude of +/- 0.2.I suggest loading your input BAM in IGV and looking at the read pileup at this exon to see whether it suggests a deletion. Are there reads mapped there? Does the pileup have the same shape as other captured exons? Are the reads properly mapped, with reasonably high mapping quality scores?
You can see from the 'depth' column that in the bin chr17:41197644-41197870, your input BAM file had a depth of coverage comparable to other diploid genomic regions. That bin covers an exon and about 40bp of flanking regions on either side. If the deletion you detected through some other experimental method only covers a portion of the exon (maybe 20% of the bin), then the total number of sequencing read bases mapped to that bin could be closer to the genome-wide average despite a small deleted region within it.
Other possibilities:
Hi Eric, I aligned my samples to another reference, 1000G phase 3 (hs37d5), and I also applied BAM recalibration as suggested by GATK. In the end I applied the commands for cnvkit as I described above. The results are not so different from the others obtained before. Here is the new scatter https://ibb.co/czzZMG
And here is the IGV screenshot: https://ibb.co/efgygG
looking at the .cnr file:
the bin of interest is
And that is the exon 22 for some transcripts and the exon 23 for others, as you suggested. The deletion was confirmed with MLPA technique but in the report that I have for this sample it is indicated only the exon number with the deletion but it is not declared respect to which transcript.... The data supporting the deletion is not so strong but I think that I am ready to try cnvkit to unknown samples :) Last doubt on scatter plot: the yellow vertical line is the region of the gene while the orange horizontal line is the segment in the .cns file,is it correct? Again many thanks for the help.
Stefania
Glad to see it worked for you. Yes, you're reading the plot correctly -- the orange line shows the weighted mean of the bin log2 values within a copy number segment, and the yellow span shows the coordinates of BRCA1 as the bins are labeled in the .cnr file (which is not necessarily identical to the gene boundaries in RefSeq, just the first and last .cnr bins labeled "BRCA1").
Does the z-test script give this bin a persuasively small p-value now?
It looks like this is a very focal hemizygous loss, which is generally difficult to detect from read depth alone. If you have SNPs called within this exon, you can also look at allele frequencies and phasing to confirm that one allele was preferentially lost.
Hi Eric, Even in this case the z-test result is 0:
As you said is not an easy deletion to detect and I have not SNPs called in the same region...
Again many thanks!!
Stefania
Hi Eric,
This link https://github.com/etal/cnvkit/blob/master/scripts/cn_ztest.py
Does not work. Would it be possible, please, to share your code with me too?
Thanks
Thanks for catching this, I've updated the link above. It's now: https://github.com/etal/cnvkit/blob/master/scripts/cnv_ztest.py
Hi Eric,
This link isn't working, are you still sharing this with the community? Would you also have a sample.bam that can be used to verify exon deletions for a sanity check?
Many thanks