GermlineCNVCaller- Interval scattering for WES data
0
0
Entering edit mode
4.3 years ago
Z-F ▴ 20

Hi everyone,

I am working with the gCNV pipeline using 200 WES samples to make the cohort ploidy model first.

Following this tutorial, I am now trying to call copy numbers using GermlineCNVCaller. In part 4.2, I could find some guides into how to provide interval lists for scattering.

I am somehow confused how to calculate this for exome data.

In the tutorial, it is written that "The current recommendation is to provide at least ~10–50Mbp genomic coverage per scatter ". Since the Filtered interval list I have provided for the analysis is based on the bed file gotten from the enrichment kit, it covers 33 Mb of the genome. I was wondering if this means that there is no need to make the interval subsets for exome data? I actually run the command without subseting the intervals and after 3 days the "denoising (warm-up) epoch 1" is finished and the command is still running.. see the report below:

10:31:01.539 INFO gcnvkernel.tasks.inference_task_base - (denoising (warm-up) epoch 1) ELBO: -4.797 +/- 0.000, SNR: 13.2, T: 1.00: 100%|#########9| 4997/5000 [45:59:43<01:39, 33.14s/it] 10:31:34.540 INFO gcnvkernel.tasks.inference_task_base - (denoising (warm-up) epoch 1) ELBO: -4.797 +/- 0.000, SNR: 13.1, T: 1.00: 100%|#########9| 4998/5000 [46:00:16<01:06, 33.14s/it] 10:32:06.026 INFO gcnvkernel.tasks.inference_task_base - (denoising (warm-up) epoch 1) ELBO: -4.797 +/- 0.000, SNR: 13.1, T: 1.00: 100%|#########9| 4999/5000 [46:00:47<00:33, 33.14s/it] 10:32:42.093 INFO gcnvkernel.tasks.inference_task_base - (denoising (warm-up) epoch 1) ELBO: -4.797 +/- 0.000, SNR: 13.0, T: 1.00: 100%|##########| 5000/5000 [46:01:23<00:00, 33.14s/it] 10:32:42.094 WARNING gcnvkernel.tasks.inference_task_base - Inference task completed successfully but convergence not achieved. 10:32:42.095 INFO gcnvkernel.tasks.task_cohort_denoising_calling - Instantiating the denoising model (main)... 10:36:23.022 INFO gcnvkernel.tasks.task_cohort_denoising_calling - Instantiating the sampler... 10:36:23.023 INFO gcnvkernel.tasks.task_cohort_denoising_calling - Instantiating the copy number caller... 10:45:54.298 INFO gcnvkernel.models.fancy_model - Global model variables: {'psi_t_log__', 'log_mean_bias_t', 'W_tu', 'ard_u_log__'} 10:45:54.299 INFO gcnvkernel.models.fancy_model - Sample-specific model variables: {'z_sg', 'psi_s_log__', 'read_depth_s_log__', 'z_su'} 10:45:54.300 INFO gcnvkernel.tasks.inference_task_base - Instantiating the convergence tracker... 10:45:54.300 INFO gcnvkernel.tasks.inference_task_base - Setting up DA-ADVI... 10:52:55.026 INFO gcnvkernel.tasks.task_cohort_denoising_calling - A warm-up task was provided -- copying mean-field parameter values, temperature, and optimizer moments from the warm-up task... 10:52:55.160 INFO gcnvkernel.tasks.inference_task_base - (denoising (main)) starting...: 0it [00:00, ?it/s] 10:52:55.163 INFO gcnvkernel.tasks.inference_task_base - (sampling epoch 1): 0%| | 0/10 [00:00

If we should do subset the exome intervals, can you please provide some information about how doing this since the example in the tutorial calculate 1Kb bins, but for exome data the bin 0 and padding 250 is selected, and I cannot figure out how to choose the number of scatter content in order to have at least 10-50 Mbp genomic coverage per scatter.

Thanks in advance.

CNV GATK GermlineCNVCaller WES • 1.5k views
ADD COMMENT
0
Entering edit mode

I have worked with GATK caller with tumor-normal mode only - there off-target reads are beneficial, but for germline exome seq I would say it creates just additional complexity without a significant increase in sensitivity (it can detect only large CNVs). Surely 1KB is too small - maybe try 25KB instead (depends on the number of your off-target reads, I guess).

In a meanwhile you can try ClinCNV - I can provide more assistance with that caller since I've developed it.

It is not that I consider GATK caller worse, but 3 days is surely too long to analyze even thousands of samples. Something went wrong, and I am not sure what.

ADD REPLY
0
Entering edit mode

Thanks for the reply.

To be more clear, I actually determined an interval in my command and it seems that it will calculate the CNVs according to the exome interval file I have provided (33Mbp).

The point is, as far as I understood, GATK tutorial says this step needs a lot of memory, and to prevent such long times, its better to scatter intervals into regions covering at least 10 Mbp to 50 Mbp and then provides some confusing explanation into how calculate the number of scatters based on the bins you have selected in providing the interval file.

But, the point is, GATK recommended binning only for WGS intervals and we only padded exome intervals and disabled binning when preprocessing the interval file.

So, that's why I cannot figure out how to subset the interval file for my exome samples.

By the way, thanks for suggesting ClinCNV, I have already started to check the program and will try to use it.

ADD REPLY
0
Entering edit mode

I am not sure if I fully understand what GATK asks to do, but my understanding is that with this binning in WES it either 1) tries to involve off-target coverage, 2) it is not needed. I would not do binning of WES data at all (unless one want to detect intra-exonic variants or it is off-target reads directed task with the aim to detect large non-coding CNVs). I also can not understand what scattering actually does and why it is required for cohort mode. Your BED file with targeted regions provides a coverage structure very different from what is expected from WGS and how such sparse uneven coverages may help in scattering - I honestly don't know. Is there a possibility to skip this step and check the results?

ADD REPLY
0
Entering edit mode

Yes, I actually skipped the scattering step.

As far as I understood, GATK says that this scattering only helps in improving speed and the need for larger memory.

I just did not scatter the intervals and simply used the exome interval file, I had prepared according to their recommendations.

Now, it is running for 3 days (cohort mode-200 WES samples), using the average 120 G memory on the server and load average of 40 (sometimes increase to 80 even).

There is no error and the command is running and the steps are improving and I am waiting to see how long does it take and if it can be finished properly.

But, I am still concerned that by skipping this step, I might do it wrong and the whole process fail at the end ( I don't know by sure and that's why I asked).

By the way, GATK also says no binning for WES.

ADD REPLY
0
Entering edit mode

I guess it is used to "smooth" the coverage of each bin using bins right next to the one. So, if the goal is to find one-exon CNVs, this step surely has to be skipped. I wander if the results will look good without that step - please let us know if the calls look fine for you in the end =)

It is quite enormous amount of time and memory consumed, I would say, but if the results are exceptionally good - why not.

ADD REPLY

Login before adding your answer.

Traffic: 2524 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6