Dear community members,
we've prepared a tool for CNV detection (another one) called ClinCNV. It was already used for the analysis of around 5 thousands of samples sequenced on different platforms and the results are quite good, we also performed the benchmarking and found out that the tool is at least not worse than the competitors in germline context and works better for somatic context (using False Discovery Rate and concordance as metrics). You can check out a short presentation of the tool here (around 60 slides).
The tool uses cohorts of samples and read-depth (and BAF for somatic calling). It has quite a lot of features, such as clustering of samples prior to analysis, IGV visualization, polymorphic regions calling, mosaic CNV calling, different options for FDR control, etc. To have a quick overview I'd recommend to go directly to the docs. Try the test run with the command from here.
The limiting factor may be - we used ngs-bits for files preparation, however, it is an easy-to-install package, it is fast and has many useful features.
Please send me any feedback about the tool.
UPD the preprint is here, somatic part of ClinCNV. Please, criticize it. https://www.biorxiv.org/content/10.1101/837971v1
UPD2: ClinCNV's germline CNVs detection procedure and results were not published in any form - FIXED, below
UPD3: Tumor-only calling is implemented. Still requires approx 20 normal samples sequenced with the same enrichment kit. Highly recommended to be used with BAF-files and off-target reads. Limitations: less than 50% of the genome affected by CNVs, purity > 30%, no polyploidies. In summary - fine for blood cancers, maybe not good for 50% of the solid tumors. Still an experimental feature - one may send the results to me if they are unsatisfactory and we can decide what to improve.
UPD4: Germline CNV calling preprint is on bioRxiv and is citable https://www.biorxiv.org/content/10.1101/2022.06.10.495642v1
May have helped:
https://twitter.com/BiostarBot/status/1185192354438897665
:)
Thanks a lot, Kevin!
hey man i was making file preparation, and in the manual is:
////// Then you need to merge your ".cov" files into one table. To do this, you can use script mergeFilesFromFolder.R script provided with ClinCNV using input_folder and output_folder as variables to keep your absolute paths:
Rscript mergeFilesFromFolder.R -i $input_folder -o $output_folder \\\\
But with --help u can see the next -o CHARACTER, --out=CHARACTER output file name [default= out.txt]
this is right and manual is not.
Also, could you make your script for merging only .cov files rather everyone in the folder? If it not very hard, i think it would be good to allow use a wildcards: Rscript mergeFilesFromFolder.R -i *.cov -o batch.txt
Thanks a lot! Will fix it on Monday
I've tried to overcome this exception near hour. I think I can beat it, but now i should leaving. Maybe you can help me
I've proceeded files obtained byTruSightCardioSeqKit (alignmented on the GRCh37_latest_genomic.fna). yeap, i think i haven't got the chrY in my dataset
simple_command (i made the simplest one for first run on my data)
Below
.
p.s. the biostar makes hot mess when publish this post; i don't know how to save the table view of the data
Hey, I tidied your code and output via the
101 010
button.Tidied again
oh, thanks, i see now how the magic 101 010 button :) sorry for mess, i think this is my first posts on biostar
ClinCNV for now does not like small panels of genes, mainly due to lack of testing - we simply have not included small panels into our test routine. ClinCNV likes bigger panels since it performs gc and length normalization and in small panels it is not so easy. I'll work on it on Monday, again, but what you can try right now - divide your on target bed file with the command BedChunk into pieces of length of 150 bp, for example. The way how to use the command is described in off target reads section. Then re calculate coverage and run it again. It solved the problem for our collaborators with the same panel, as I remember.
okay, thanks. I'll try it today
I found a test case that reproduces your error. Will fix it ASAP, will write you once it will be fixed.
I have a free time and sent my data to German. I did it a few minutes ago, seems that i've late. Sorry :| But anyway, hope the error can be simple fixed.
Try to make a git pull now =) and run the same command. it should work.
thanks for the data, it does work, I've sent you the results back.
Thank you for the tool... I am going to test it on a set of my data and I was wondering if you could clarify how you run a set of germline samples against a set of normal germline controls?
Hi Duarte! We do not use controls in ClinCNV. You provide some (as many as possible) samples sequenced with the same technology (and better in the same lab) and the tool infers CNVs for all the samples included, even if they are just controls. It is possible to run the tool only for one sample - flag
--normalSample
has to be specified then with the ID of the sample of interest.Thanks
I am now testing my samples. I am excited to see how your tool performs on them...
However I do notice that the threads arguments does not seem to do much to improve speed.
I gave it quite a few threads and I can see they are started (in the list of processes running( but they seem to all be dormant expect for 1 and the speed at which samples are being processed does not seem any faster that on a single thread.
That's correct - it is parallelised only partially. There are 2 time consuming steps which are parallelised - GC normalization and final calling. In theory, these 2 should work faster with more threads (but more than 8 does not make sense - for germline calling there are only 8 copy-number states). Please, let me know how the tool worked, how do you like an output, how do you plan to post-process the samples - and I'll try to help you with this.
the germline... you do not use the TSV files with b-allele frequencies at all?
at all. We've benchmarked the tool using B-allele frequencies and without them (for germline). For most of short CNVs there is no SNVs inside => no B-allele frequencies at all, but long CNVs can be detected using coverage only. So we removed this feature at all. Additional burden of time / no difference in benchmarking (only marginal, like by 1% of Precision/Recall in WES). However, I discourage running tumor CNVs calling without B-allele frequency - they are really changing the game there.
thanks
could I offer a few suggestions?
1st) on the output tsv file for each sample, for some reason the length_KB field contains spaces. it seems an effort to gae the same number of spaces for each length? I really don't see the point and it will probably just lead to unwanted problems for people parsing those files expecting to use white spaces as a delimiter?
2nd) It would be cool if you could analise a set of input files and create a folder of the analysis of the group... this way you could use that data to analise a single sample of that set without requiring redoing all the initial reclustering and read depth analysis. I know you can set --normalSample to analyse only a single sample of the set... but the intial steps get redone... it that correct?
Thanks for suggestions! Indeed, we made same number of spaces for length because doctors asked us to do so (as I remember), they check results in excel and it was more convenient for them. The columns are tab separated, so spaces may be stripped from both ends of any cell value.
I can implement "analysis only of the listed samples", that's not a problem, may be in couple of days. So far you may use - - reanalyseCohort F so ClinCNV will not try to reanalyse samples that you already analyzed (if their folders are created in the output folder)
Thanks ... but in relation to the second point I don't think you got the gist of what I was suggesting. I meant saving all the analysis data you do to a given cohort as a data file so that when you rerun and you say you want to only a analyse 1 sample, that initial process of clustering, gender detection etc... can be just read from the a file on the analysis results and not run every thing again.... for what I can see on the test I have done ... the analysis of 1 sample takes about 3 min on my panel... but the initialisation and clustering probably takes twice as much.
if I wanted to invoke the process 20 times for 20 samples, it would run that initial clustering analysis every single time even though based on the same input files that clustering would be yielding always the same clustering results
Current method:
1) read input data
2)cluster analysis, gender , coverage, etc...
3) run each sample analysis
4) Finish
second time indicating a specific sample : Current method: 1) read input data
2) cluster analysis, gender , coverage, etc...
3) run just the specified sample or list of samples
4) Finish
My suggestion: 1) read input data
2) cluster analysis, gender , coverage, etc... > stored as a file in the results folder
3) run each sample analysis
4) Finish
second time indicating a specific sample or sample list :
1) read input data
2a) Check for cluster analysis folder in results > read file
Or
2b) do cluster analysis, gender , coverage, etc... > stored as a file in the results folder
3) run just the sample
4) Finish
in this case from this point forward every time the script was invoked with the same initial inputs, if the cluster analysis file was there that time would be skipped as it would only involve reading the cluster file that was already present in the results folder
Ah, I see. That's why I did not pack this tool as R package =) You may add "save.image()" to the beginning of https://github.com/imgag/ClinCNV/blob/master/germline/germlineSolver.R file and then use "load(name_of_saved_image)" and run just
germlineSolver.R
script with anotheropt$normalSample
value. Somehow, libraries from the beginning of the main script have to be loaded too.I used this mode for initial tuning of parameters / debugging , at the end, when you establish your parameters, you won't need this intermediate saving of the file. Once you add a new sample or remove a sample, you need to recalculate everything anyways.
Hi, My question regarding the somatic CNV is that can we prioritize them according to their functional consequences which correlate to their parameter by this tool.
I am asking because after knowing the CNVs how to use them in a biological context?
Hi Ravinsit06, we use cancer genome interpreter for the annotation. I can upload scripts that we use, you also can download the database from cgi website. https://www.cancergenomeinterpreter.org/home