Working With Large *.Bam Files
4
5
Entering edit mode
14.3 years ago
User 4133 ▴ 150

Hello everyone,

This is the first time I use Bio Star to ask a question, I would like to congratulate all those who write.

Now, my question is: I'm working with a very large *.bam file (about 65 Gb) related to pair-end reads of an entire transcriptome. My final goal is to find translocations, so as preliminary step I'm looking for the mate-pair that appear in different chromosome.

Due to the file size I can't use the .bam file as a .txt one. How do I overcome this problem? Do you know the 'pysam' module of python? Others ideas?

Thanks

bam • 5.8k views
ADD COMMENT
8
Entering edit mode
14.3 years ago

Each record in the SAM/BAM file contains the reference sequence name and the mate reference sequence name. You can stream through the file looking for records where these two names are different. This will identify pairs that have their ends mapped to different chromosomes. No need for any data transformations at all; the process is the same whether your SAM/BAM file is 1Gb or 100Gb.

e.g.

samtools view myfile.bam |perl -ne '@f=split; print if $f[6] ne "=" && $f[5] >= 20'

This says print any SAM record where field 5 (mapping score) is >= 20 and field 6 (mate reference name) is not the same as the query reference name. This is just a simplified example; you might want to look at the alignment flags field too.

Be aware that the mate reference name may appear as * if the mate-pair fields have not been set in your BAM file. This will depend on how the BAM file was made.

Also, it's probably a bad idea to convert a BAM file that size to SAM because of disk I/O overheads; operate on a stream instead.

ADD COMMENT
2
Entering edit mode
14.3 years ago

BAM is a binary file, so you can't use it as a .txt file.

If you use SAM, then can you just build a file with all the read-pair/chromosome and then do sort/uniq with this file ?

If memory is a problem, I would build a database [pair-id,chrom] with a SQL engine or even better, a key/value engine (berkeleyDB , etc... )

ADD COMMENT
0
Entering edit mode
14.3 years ago
User 4133 ▴ 150

Thank you, in my question I omitted that the conversion from BAM to SAM has been done...my problem is just the memory. I will try. Thanks.

ADD COMMENT
0
Entering edit mode
14.3 years ago
User 4133 ▴ 150

Thank you Keith James, I would have another question: how can you compute the mappig score from binary code? Can you sum over all alignment positions? And, in such case, what is the best score in your opinion?

ADD COMMENT
1
Entering edit mode

Hi, ilwollo. Could you open a new question for this? If you want to 'Add Another Answer' to your own question, it should be to offer a solution. If you just want to reply to an answer, please use the 'add comment' link instead.

ADD REPLY

Login before adding your answer.

Traffic: 2768 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6