I have some big bam files (more than 200GB per file) which need to be split for downstream analysis. My plan was to split the bam by chromosome until I realized there are hundreds of unplaced scaffolds in my reference genome.
The whole reference genome has 747 scaffolds in total where includes 11 chromosomes, which means there are 736 unplaced scaffolds. If I just split it by chromosome, I would lose much information.
In that situation, how should I do to put all reads aligned to unplaced scaffolds in a single bam and split the rest reads by chromosomes?
p.s. Best to use samtools. I used to use bamtools to do the split work, the file generated by bamtools some how lost EOF marker.
Thanks.
Please see the answer here: samtools: splitting a bam file putting all scaffolds together
Thank you! The accepted answer in that post works!
For the other solution, I'm very curious that how to generate the BED file from a header?
Sorry, My questions are too rudimentary.