Tool:ClinCNV: CNV detection from short reads
0
15
Entering edit mode
5.2 years ago

Dear community members,

we've prepared a tool for CNV detection (another one) called ClinCNV. It was already used for the analysis of around 5 thousands of samples sequenced on different platforms and the results are quite good, we also performed the benchmarking and found out that the tool is at least not worse than the competitors in germline context and works better for somatic context (using False Discovery Rate and concordance as metrics). You can check out a short presentation of the tool here (around 60 slides).

The tool uses cohorts of samples and read-depth (and BAF for somatic calling). It has quite a lot of features, such as clustering of samples prior to analysis, IGV visualization, polymorphic regions calling, mosaic CNV calling, different options for FDR control, etc. To have a quick overview I'd recommend to go directly to the docs. Try the test run with the command from here.

The limiting factor may be - we used ngs-bits for files preparation, however, it is an easy-to-install package, it is fast and has many useful features.

Please send me any feedback about the tool.

UPD the preprint is here, somatic part of ClinCNV. Please, criticize it. https://www.biorxiv.org/content/10.1101/837971v1

UPD2: ClinCNV's germline CNVs detection procedure and results were not published in any form - FIXED, below

UPD3: Tumor-only calling is implemented. Still requires approx 20 normal samples sequenced with the same enrichment kit. Highly recommended to be used with BAF-files and off-target reads. Limitations: less than 50% of the genome affected by CNVs, purity > 30%, no polyploidies. In summary - fine for blood cancers, maybe not good for 50% of the solid tumors. Still an experimental feature - one may send the results to me if they are unsatisfactory and we can decide what to improve.

UPD4: Germline CNV calling preprint is on bioRxiv and is citable https://www.biorxiv.org/content/10.1101/2022.06.10.495642v1

variant-calling cna cnv • 4.2k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
1
Entering edit mode

Thanks a lot, Kevin!

ADD REPLY
1
Entering edit mode

hey man i was making file preparation, and in the manual is:

////// Then you need to merge your ".cov" files into one table. To do this, you can use script mergeFilesFromFolder.R script provided with ClinCNV using input_folder and output_folder as variables to keep your absolute paths:

Rscript mergeFilesFromFolder.R -i $input_folder -o $output_folder \\\\

But with --help u can see the next -o CHARACTER, --out=CHARACTER output file name [default= out.txt]

this is right and manual is not.

Also, could you make your script for merging only .cov files rather everyone in the folder? If it not very hard, i think it would be good to allow use a wildcards: Rscript mergeFilesFromFolder.R -i *.cov -o batch.txt

ADD REPLY
1
Entering edit mode

Thanks a lot! Will fix it on Monday

ADD REPLY
0
Entering edit mode

I've tried to overcome this exception near hour. I think I can beat it, but now i should leaving. Maybe you can help me

[1] "Percentage of regions remained after GC correction: 0.957518796992481"
Error in gcNormalisedCov[which(!bedFile[, 1] %in% c("chrX", "chrY")),  : 
  subscript out of bounds
Calls: writeOutLevelOfNoiseVersusCoverage -> apply
Execution halted

I've proceeded files obtained byTruSightCardioSeqKit (alignmented on the GRCh37_latest_genomic.fna). yeap, i think i haven't got the chrY in my dataset

#################################

simple_command (i made the simplest one for first run on my data)

Rscript ~/progs/ClinCNV/clinCNV.R --normal $Files/batch_1.cov --bed $Path/gcAnnotated.extended_trusight.bed --out $Files/RES --numberOfThreads 8

Below

 1. head -n 10 my_bed.bed; tail -n 10 my_bed.bed

chr1    2985722 2985960 0.7017
chr1    3102587 3103138 0.6316
chr1    3160549 3160801 0.5556
chr1    3301612 3301950 0.5888
chr1    3303139 3303360 0.3801
chr1    3310955 3311158 0.7094
chr1    3312953 3313257 0.6151
chr1    3319253 3319662 0.6381
chr1    3321201 3321550 0.6476
chr1    3321957 3322312 0.7014
......
chrX    153607743   153608479   0.6644
chrX    153608492   153608827   0.5851
chrX    153609011   153609657   0.6022
chrX    153640079   153640651   0.6958
chrX    153641442   153641693   0.6096
chrX    153641717   153642004   0.6202
chrX    153642336   153642627   0.5258
chrX    153647780   153648185   0.5926
chrX    153648269   153648703   0.6221
chrX    153648895   153649443   0.6150

.

 2. head -n 5 batch_1.cov; tail -n 5 batch_2.cov

X.chr   start   end X100_S3_Srt X102_S5_Srt X104_S7_Srt X106_S4_Srt X107_S9_Srt X108_S10_Srt    X109_S11_Srt    X110_S5_Srt X111_S6_Srt X113_S8_Srt X114_S9_Srt X116_S10_Srt    X117_S11_Srt    X125_S2_Srt X127_S3_Srt X129_S4_Srt X130_S5_Srt X131_S6_Srt X132_S7_Srt X133_S8_Srt X135_S9_Srt X136_S10_Srt    X137_S11_Srt    X139_S12_Srt    X17_S2_Srt  X23_S5_Srt  X32_S4_Srt  X52_S1_Srt  X86_S2_Srt  B_S12_Srt   ry_S12_Srt
chr1    112318597   112319000   77.273  124.1538    120.196 27.2283 137.1762    137.6774    143.9801    26.3077 44.1663 28.0819 79.0943 37.2357 47.5509 108.5236    87.34   147.5881    79.5186 70.3871 100.9355    30.3772 129.4888    153.3052    90.6998 126.866 115.6725    68.5782 120.9082    114.8635    46.7395 52.9504 82.603
chr1    112319546   112319995   83.412  116.5367    115.8998    35.1849 107.0111    105.4454    127.0022    26.92247.9599   30.7461 100.5323    42.6303 49.9844 56.6414 66.5234 89.5278 55.8842 56.098  70.5702 23.2138 84.902  121.3163    61.9198 79.0757 119.4053    61.0913 138.4454    97.4232 52.4365 53  71.0045
chr1    112320956   112321214   57.124  111.7713    79.3837 22.593  88.5155 96.3605 114.5349    21.2364 38.155  12.2519 66.7907 23.5426 40.8295 71.0116 65.1822 64.062  43.4612 47.1124 78.155  24.1434 70.2442 85.6822 58.8837 54.2326 81.1705 46.4264 93.1008 65.2326 41.0659 32.8915 61.0543
chr1    112322745   112323036   109.0997    122.4158    125.433 48.9588 119.6632    116.6186    145.1684    36.7938 55.9072 38.3162 110.3196    48.0309 70.3643 85.457  65.7182 102.7148    57.1581 50.2749 83.9897 17.1031 79.8247 106.9244    85.1512 98.2027 98.6529 62.2027 109.9966    134.9485    74.5223 71.677  71.866
....
chrX    32867743    32868037    44.5816 61.1599 54.1429 30.1361 68.0204 122.5442    67.8163 9.6599  27.7211 7.1769  37.2517 35.2313 36.2755 43.0306 30.9966 31.5612 27.5816 50.4014 38.5986 26.8776 25.6769 57.3061 69.7891 30.051  52.8299 52.3639 41.0646 32.381  34.53421.2483   74.5102
chrX    33038154    33038417    35.711  107.3992    104.365 32.7452 108.3954    168.0798    78.057  8.4335  20.7376 8.045646.6882   31.5247 42.384  102.7224    47.8669 86.5247 62.7224 104.9696    97.1711 45.1977 80.4259 164.0875    176.8327    35.1255 56.5323 81.365  79.0951 32.4563 24.6388 13.3612 109.8175
chrX    33146162    33146382    65.8364 114.0682    107.6   48.7227 136.7   190.2455    110.3227    19.6545 35.5818 26.9364 89.5682 73.5955 81.7864 74.1    31.6636 48.15   42.4227 63.0727 78.4136 36.5591 52.2864 130.5909    144.1955    49.9455 94.1136 148.0182    102.3182    83.4455 48.0136 38.7682 126.65
chrX    33229297    33229529    43.3793 77.8534 79.5259 32.6121 84.7672 128.0345    79.2629 15.7974 23.2328 12.1983 36.9353 26.4569 44.2241 57.9828 35.8448 40.0517 33.5043 79.9526 58.3534 18.8578 51.2198 97.6681 105.6595    30.0259 65.0948 78.3103 48.8966 41.3922 27.4397 15.8922 96.7543
chrX    33357274    33357482    43.4183 98.2356 69.0529 34.0288 83.2548 155.5817    81.6442 8.5481  36.6731 15.9567 47.4519 36.0337 61.3413 54.2452 39.9567 58.2837 34.7596 72.8221 62.0721 32.8798 56.0385 117.0337    123.8413    51.4038 74.3077 78.7692 77.7548 55.899  23.7404 24.7452 96.7163

p.s. the biostar makes hot mess when publish this post; i don't know how to save the table view of the data

ADD REPLY
2
Entering edit mode

Hey, I tidied your code and output via the 101 010 button.

ADD REPLY
0
Entering edit mode

Tidied again

ADD REPLY
1
Entering edit mode

oh, thanks, i see now how the magic 101 010 button :) sorry for mess, i think this is my first posts on biostar

ADD REPLY
0
Entering edit mode

ClinCNV for now does not like small panels of genes, mainly due to lack of testing - we simply have not included small panels into our test routine. ClinCNV likes bigger panels since it performs gc and length normalization and in small panels it is not so easy. I'll work on it on Monday, again, but what you can try right now - divide your on target bed file with the command BedChunk into pieces of length of 150 bp, for example. The way how to use the command is described in off target reads section. Then re calculate coverage and run it again. It solved the problem for our collaborators with the same panel, as I remember.

ADD REPLY
0
Entering edit mode

okay, thanks. I'll try it today

ADD REPLY
0
Entering edit mode

I found a test case that reproduces your error. Will fix it ASAP, will write you once it will be fixed.

ADD REPLY
0
Entering edit mode

I have a free time and sent my data to German. I did it a few minutes ago, seems that i've late. Sorry :| But anyway, hope the error can be simple fixed.

ADD REPLY
0
Entering edit mode

Try to make a git pull now =) and run the same command. it should work.

ADD REPLY
0
Entering edit mode

thanks for the data, it does work, I've sent you the results back.

ADD REPLY
0
Entering edit mode

Thank you for the tool... I am going to test it on a set of my data and I was wondering if you could clarify how you run a set of germline samples against a set of normal germline controls?

ADD REPLY
0
Entering edit mode

Hi Duarte! We do not use controls in ClinCNV. You provide some (as many as possible) samples sequenced with the same technology (and better in the same lab) and the tool infers CNVs for all the samples included, even if they are just controls. It is possible to run the tool only for one sample - flag --normalSample has to be specified then with the ID of the sample of interest.

ADD REPLY
0
Entering edit mode

Thanks

I am now testing my samples. I am excited to see how your tool performs on them...

However I do notice that the threads arguments does not seem to do much to improve speed.

I gave it quite a few threads and I can see they are started (in the list of processes running( but they seem to all be dormant expect for 1 and the speed at which samples are being processed does not seem any faster that on a single thread.

ADD REPLY
0
Entering edit mode

That's correct - it is parallelised only partially. There are 2 time consuming steps which are parallelised - GC normalization and final calling. In theory, these 2 should work faster with more threads (but more than 8 does not make sense - for germline calling there are only 8 copy-number states). Please, let me know how the tool worked, how do you like an output, how do you plan to post-process the samples - and I'll try to help you with this.

ADD REPLY
0
Entering edit mode

the germline... you do not use the TSV files with b-allele frequencies at all?

ADD REPLY
1
Entering edit mode

at all. We've benchmarked the tool using B-allele frequencies and without them (for germline). For most of short CNVs there is no SNVs inside => no B-allele frequencies at all, but long CNVs can be detected using coverage only. So we removed this feature at all. Additional burden of time / no difference in benchmarking (only marginal, like by 1% of Precision/Recall in WES). However, I discourage running tumor CNVs calling without B-allele frequency - they are really changing the game there.

ADD REPLY
0
Entering edit mode

thanks

could I offer a few suggestions?

1st) on the output tsv file for each sample, for some reason the length_KB field contains spaces. it seems an effort to gae the same number of spaces for each length? I really don't see the point and it will probably just lead to unwanted problems for people parsing those files expecting to use white spaces as a delimiter?

2nd) It would be cool if you could analise a set of input files and create a folder of the analysis of the group... this way you could use that data to analise a single sample of that set without requiring redoing all the initial reclustering and read depth analysis. I know you can set --normalSample to analyse only a single sample of the set... but the intial steps get redone... it that correct?

ADD REPLY
0
Entering edit mode

Thanks for suggestions! Indeed, we made same number of spaces for length because doctors asked us to do so (as I remember), they check results in excel and it was more convenient for them. The columns are tab separated, so spaces may be stripped from both ends of any cell value.

I can implement "analysis only of the listed samples", that's not a problem, may be in couple of days. So far you may use - - reanalyseCohort F so ClinCNV will not try to reanalyse samples that you already analyzed (if their folders are created in the output folder)

ADD REPLY
0
Entering edit mode

Thanks ... but in relation to the second point I don't think you got the gist of what I was suggesting. I meant saving all the analysis data you do to a given cohort as a data file so that when you rerun and you say you want to only a analyse 1 sample, that initial process of clustering, gender detection etc... can be just read from the a file on the analysis results and not run every thing again.... for what I can see on the test I have done ... the analysis of 1 sample takes about 3 min on my panel... but the initialisation and clustering probably takes twice as much.

if I wanted to invoke the process 20 times for 20 samples, it would run that initial clustering analysis every single time even though based on the same input files that clustering would be yielding always the same clustering results

Current method:

1) read input data

2)cluster analysis, gender , coverage, etc...

3) run each sample analysis

4) Finish

second time indicating a specific sample : Current method: 1) read input data

2) cluster analysis, gender , coverage, etc...

3) run just the specified sample or list of samples

4) Finish

My suggestion: 1) read input data

2) cluster analysis, gender , coverage, etc... > stored as a file in the results folder

3) run each sample analysis

4) Finish

second time indicating a specific sample or sample list :

1) read input data

2a) Check for cluster analysis folder in results > read file

Or

2b) do cluster analysis, gender , coverage, etc... > stored as a file in the results folder

3) run just the sample

4) Finish

in this case from this point forward every time the script was invoked with the same initial inputs, if the cluster analysis file was there that time would be skipped as it would only involve reading the cluster file that was already present in the results folder

ADD REPLY
0
Entering edit mode

Ah, I see. That's why I did not pack this tool as R package =) You may add "save.image()" to the beginning of https://github.com/imgag/ClinCNV/blob/master/germline/germlineSolver.R file and then use "load(name_of_saved_image)" and run just germlineSolver.R script with another opt$normalSample value. Somehow, libraries from the beginning of the main script have to be loaded too.

I used this mode for initial tuning of parameters / debugging , at the end, when you establish your parameters, you won't need this intermediate saving of the file. Once you add a new sample or remove a sample, you need to recalculate everything anyways.

ADD REPLY
0
Entering edit mode

Hi, My question regarding the somatic CNV is that can we prioritize them according to their functional consequences which correlate to their parameter by this tool.

I am asking because after knowing the CNVs how to use them in a biological context?

ADD REPLY
1
Entering edit mode

Hi Ravinsit06, we use cancer genome interpreter for the annotation. I can upload scripts that we use, you also can download the database from cgi website. https://www.cancergenomeinterpreter.org/home

ADD REPLY

Login before adding your answer.

Traffic: 1573 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6