Hello,
I am working on whole genome sequencing CRAM files and I want to perform GATK best practice. Before that, I want to slice each CRAM into smaller chunks, 50kb regions with 1kb padding, and avoid losing the reads. I want this to paralelize my analysis and increase the speed. After that I created gVCF files I am going to merge them.
How can I do slicing?
What have you tried? I ask because you have
bed
as a tag and are thus aware of bed-format based tools.I have tried to create a
bed
file for the hg38 reference genome with 50kb region length and 1kb padding. I used the following script to creat the bed file but I am not sure if the output is correct or not.The bed file I have created looks like this: But I am not sure if it is right way to do this. Then, I want to use this
bed
file to create CRAM chunks usingsamtools
you want
bedtools makewindows
go
Your question has already been answered 5 months ago. Why are you asking it again?
I want to add the 1kb padding, but I did not get an answer from the last question. How can I add 1kb padding?
bedtools slop
how is it different from your previous question ? How to Split 3000 WGS CRAM files into 1Mbp length chunks
Do it per chromosome, not per 50kb bin. 50kb creates thousands of file, that's a big IO burden.
I am dealing with WGS data which computation time increases in multiple steps of GATK even per chromosome. Thus, I want very small regions to parallelize the computation.
What infrastructure do you have available? Do you have a HPC that even allows to run thousands of processes and hundreds of nodes in parallel?
Yes, I have a cluster allows to run many processes and hundreds of nodes in parallel.
I kind of doubt that the scheduler gives them to you all at once. As I said, my recommendation is a per-chromosome splitting (if at all) and then just let it run. Use resources to run samples in parallel, not to split a single sample into thousands of chunks. The overhead to merge that all in the end is big and to some extend error-prone unless you have a bullet-proof pipeline which (with respect) I doubt given that you ask here for help and use R for even creating the intervals (no offense, I know it must sound like it, but it really isn't).