I have some large bam files (~100GB), from which I would like to extract fastq sequences. Of course the most straightforward way is to directly apply softwares like bam2fastq. However, in order to speed up, I tried to split the large bam files (sorted according to coordinate but not read name) chromosome by chromosome, and would extract based on each bam file.
Then I realized, this will miss those read paires which map to different chromosomes. So I'm just confused any optimized way to split bam files first and then extract for each of bam, so that to improve speed? thx
Edit: Is there any bam split tools, which can evenly split bam into, say 10 parts?
To better explain my question:
If I split my bam file by chromosome, say extract chr1 alignments from test.bam, which is named: test_1.bam
$ /share/bin/samtools-0.1.16/samtools view test_1.bam |wc -l
131168
Then if I tried to extract fastq from test_1.bam
$ bam2fastq -o test_%#_sequence.txt test_1.bam -f
This looks like paired data from lane 1.
Output will be in test_1_1_sequence.txt and test_1_2_sequence.txt
131168 sequences in the BAM file
131168 sequences exported
WARNING: 3348 reads could not be matched to a mate and were not exported
Obviously, 3348 reads are NOT extracted, this is because test_1.bam only include any read mapping to chr1; and for those unmapped reads, it's highly likely that only one end maps to chr1. And bam2fastq will only extract paired-end reads, exluding those orphans.
So I would say, splitting BAM based on name/# of reads seem the only way out. I may need to sort bam according to read name.
but I know, thanks. But how can I quickly split my bam already sorted by read name?