Question

Extract Fastq From Large Bam File

1

Entering edit mode

12.6 years ago

michealsmith ▴ 800

I have some large bam files (~100GB), from which I would like to extract fastq sequences. Of course the most straightforward way is to directly apply softwares like bam2fastq. However, in order to speed up, I tried to split the large bam files (sorted according to coordinate but not read name) chromosome by chromosome, and would extract based on each bam file.

Then I realized, this will miss those read paires which map to different chromosomes. So I'm just confused any optimized way to split bam files first and then extract for each of bam, so that to improve speed? thx

Edit: Is there any bam split tools, which can evenly split bam into, say 10 parts?

To better explain my question: If I split my bam file by chromosome, say extract chr1 alignments from test.bam, which is named: test_1.bam

$ /share/bin/samtools-0.1.16/samtools view test_1.bam |wc -l
131168

Then if I tried to extract fastq from test_1.bam

$ bam2fastq -o test_%#_sequence.txt test_1.bam -f 
This looks like paired data from lane 1.
Output will be in test_1_1_sequence.txt and test_1_2_sequence.txt
131168 sequences in the BAM file
131168 sequences exported
WARNING: 3348 reads could not be matched to a mate and were not exported

Obviously, 3348 reads are NOT extracted, this is because test_1.bam only include any read mapping to chr1; and for those unmapped reads, it's highly likely that only one end maps to chr1. And bam2fastq will only extract paired-end reads, exluding those orphans.

So I would say, splitting BAM based on name/# of reads seem the only way out. I may need to sort bam according to read name.

fastq • 7.0k views

ADD COMMENT • link 12.6 years ago by michealsmith ▴ 800

score 1 · Answer 1 · 2012-04-20

1

Entering edit mode

12.6 years ago

swbarnes2 14k

Here's something I found on seqanswers about spliting a bam file by number of entries. If you sort your .bam by read name, which Picard can do, then you can get the fastq with pairs next to each other.

http://seqanswers.com/forums/showthread.php?t=16615

ADD COMMENT • link 12.6 years ago by swbarnes2 14k

0

Entering edit mode

but I know, thanks. But how can I quickly split my bam already sorted by read name?

ADD REPLY • link 12.6 years ago by michealsmith ▴ 800

score 0 · Answer 2 · 2012-04-20

0

Entering edit mode

12.6 years ago

Sukhi Singh 11k

I think better would be to split it according to chromosomes. Check this tool, its a python script. It indexes, sorts and splits your bamfile in to chromosomes.

Also, check this question, they talk here about splitting bam files. After splitting, them you can use bam2fastq, isn't it!!!!

Cheers

ADD COMMENT • link 12.6 years ago by Sukhi Singh 11k

0

Entering edit mode

could you please edit the first link - seem to be wrong

ADD REPLY • link 12.6 years ago by Istvan Albert 101k

0

Entering edit mode

Done. Thanks!!!

ADD REPLY • link 12.6 years ago by Sukhi Singh 11k

0

Entering edit mode

But if split according to chromosome, those read pairs mapped to two different chromosomes will be distributed into different files. Then how can I reconstruct the paired-end reads?

ADD REPLY • link 12.6 years ago by michealsmith ▴ 800

0

Entering edit mode

Hi Gerry, Sorry I have no idea, how this will happen. According to Illumina Paired end note, the paired end is reading both the forward and reverse template strands of each cluster during one paired-end read. So, why they will map to different chromosomes, unless they are non-unique (MAPQ <1). Also, If you say they will be in different files, why you care, wont you be merging the fastq files back. Check this ques, its about identifying the mate pairs in different files.

ADD REPLY • link 12.6 years ago by Sukhi Singh 11k

score 0 · Answer 3 · 2012-04-20

Hi Sukhdeep, look at below:

HWI-ST195_0162:8:26:12608:92688#NANCGN  145 18  85  23  10M1I89M    5   64409   0   CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTAACCCTAACCCTTAACCCTAACCCTAACCCTAACC    ########################BAE?:AEEEBE@A?C;=>CDD@B>ABC;<7;>F?D<DBC??@>BDD;BEFGGBFGEGHHHHHGIGFIEGGGFGEHH    XT:A:U  NM:i:1  SM:i:23 AM:i:0  X0:i:1  X1:i:2  XM:i:0  XO:i:1  XG:i:1  MD:Z:99

For example, for the read-pair above, one end map to chr18, while the other map to chr5 on the reference genome. And this is not unusual to see, due to either translocation or segmental duplication.

I'll better explain my problems by editting the post soon.