I need to split enormous bam files into smaller pieces, to parallelize haplotype calling. It's widely recommended in other posts to split these by chromosome. However, my files are so large (whole genomes at high coverage) that calling haplotypes on a single chromosome can still take more than a week.
I'd therefore like to split these further, e.g. split the chr1 bam file into 8 smaller 'chunks'. However, it's essential that the reads covering a particular region of a chromosome are distributed only in the same 'chunk', otherwise it would be impossible to accurately call the haplotype of that region. Can anyone suggest how I might split a bam file and ensure this?
I've tried using the 'split' function in alntools, which is intended to split bams "such that all the alignment of a same read appear only in a single chunk." However, the project is no longer actively maintained and there are dependency conflicts that mean it no longer works. Any suggestions would be much appreciated.
You mean you want to make sure all chunks' reads are not overlapped (which means one alignment occurs in only on chunk), is there any other criteria?