How to slice a CRAM file into the 50kb regions padded with 1kb?
1
0
Entering edit mode
13 months ago
Sd • 0

Hello,

I am working on whole genome sequencing CRAM files and I want to perform GATK best practice. Before that, I want to slice each CRAM into smaller chunks, 50kb regions with 1kb padding, and avoid losing the reads. I want this to paralelize my analysis and increase the speed. After that I created gVCF files I am going to merge them.

How can I do slicing?

GATK BED CRAM WGS • 2.0k views
ADD COMMENT
0
Entering edit mode

What have you tried? I ask because you have bed as a tag and are thus aware of bed-format based tools.

ADD REPLY
0
Entering edit mode

I have tried to create a bed file for the hg38 reference genome with 50kb region length and 1kb padding. I used the following script to creat the bed file but I am not sure if the output is correct or not.

region_size <- 50000  # 50kb
padding <- 1000  # 1kb

# Initialize an empty list to store BED entries
bed_entries_list <- list()

# Generate BED entries for each chromosome
for (i in 1:nrow(chromosomes)) {
  chromosome <- chromosomes[i, 1]
  size <- as.numeric(chromosomes[i, 2])
  num_regions <- ceiling(size / region_size)

  # Generate BED entries for each region with padding
  for (j in 1:num_regions) {
    start <- (j - 1) * region_size + 1  # 1-based start position
    end <- min(j * region_size + padding, size)  # 1-based end position with padding, capped at chromosome size
    bed_entry <- c(chromosome, start, end)
    bed_entries_list[[length(bed_entries_list) + 1]] <- bed_entry
  }
}

bed_entries_matrix <- do.call(rbind, bed_entries_list)

The bed file I have created looks like this: But I am not sure if it is right way to do this. Then, I want to use this bed file to create CRAM chunks using samtools

chr1    1   51000
chr1    50001   101000
chr1    100001  151000
chr1    150001  201000
chr1    200001  251000
chr1    250001  301000
chr1    300001  351000
ADD REPLY
0
Entering edit mode

But I am not sure if it is right way to do this.

you want bedtools makewindows

Then, I want to use this bed file to create CRAM chunks using samtools

go

ADD REPLY
0
Entering edit mode

Your question has already been answered 5 months ago. Why are you asking it again?

ADD REPLY
0
Entering edit mode

I want to add the 1kb padding, but I did not get an answer from the last question. How can I add 1kb padding?

ADD REPLY
0
Entering edit mode

bedtools slop

ADD REPLY
0
Entering edit mode

how is it different from your previous question ? How to Split 3000 WGS CRAM files into 1Mbp length chunks

ADD REPLY
0
Entering edit mode

Do it per chromosome, not per 50kb bin. 50kb creates thousands of file, that's a big IO burden.

ADD REPLY
0
Entering edit mode

I am dealing with WGS data which computation time increases in multiple steps of GATK even per chromosome. Thus, I want very small regions to parallelize the computation.

ADD REPLY
0
Entering edit mode

What infrastructure do you have available? Do you have a HPC that even allows to run thousands of processes and hundreds of nodes in parallel?

ADD REPLY
0
Entering edit mode

Yes, I have a cluster allows to run many processes and hundreds of nodes in parallel.

ADD REPLY
0
Entering edit mode

hundreds of nodes in parallel.

I kind of doubt that the scheduler gives them to you all at once. As I said, my recommendation is a per-chromosome splitting (if at all) and then just let it run. Use resources to run samples in parallel, not to split a single sample into thousands of chunks. The overhead to merge that all in the end is big and to some extend error-prone unless you have a bullet-proof pipeline which (with respect) I doubt given that you ask here for help and use R for even creating the intervals (no offense, I know it must sound like it, but it really isn't).

ADD REPLY
0
Entering edit mode
13 months ago
Sd • 0

I noticed something. I already have a cram for chr22 reads ( input_chr22.cram ). Then I sliced this cram file and the output was empty in the chr22:1-10510058 .

$ samtools view -b -o output.cram input_chr22.cram chr22:1-10510058
$ samtools view output.cram 

The output was emply; however when added one coordinate:

$ samtools view -b -o output.cram input_chr22.cram chr22:1-10510059
$ samtools view output.cram 

A00393:49:H2JWTDSXX:3:2120:27661:13448  99      chr22   10510059        0       151M    =       10510230        322     GCAACTAATATGTATTTTCAAAGCATTATAAATACAGAGTGCTAAGTTACTTCACTGTGAAATGTAGTCATATAAAGAACATAATAATTATACTGGATTATTTTTAAATGGGCTGTCTAACATTATATTAAAATGTTTCATCAGTAATTCA       FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF,FFFF:FFFFFFFF,FFFF:FFFFFFFFFFF:        PG:Z:MarkDuplicates     MQ:i:0  AS:i:146        XS:i:146        MD:Z:133G17    NM:i:1  RG:Z:novaseq01_2_180401_A00393_0049_BH2JWTDSXX.s_3_300_200.001

Since, it is a WGS file, why there is not any read in the 1-10510058 region?

ADD COMMENT
2
Entering edit mode

chr22 is telomeric...

ADD REPLY

Login before adding your answer.

Traffic: 1048 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6