Split A Bam File Into Several Files Containing All The Alignments For X Number Of Reads.
3
1
Entering edit mode
11.6 years ago
gaelgarcia05 ▴ 280

Hi everyone!

I am struggling with annotating a very big .bam file that was mapped using TopHat. The run was a large number of reads : ~200M. The problem is that when I now try to Annotate each read using a GFF file (with BEDTools Intersect Bed), the BED file that is made is huge : It is over 1.7TB ! I have tried running it on a very large server at the institution, but it still runs out of disk space. The IT dept increased $TMPDIR local disk space to 1.5TB so I could run everything on $TMPDIR, but it is still not enough.

What I think I should do is split this .BAM file into several files, maybe 15, so that each set of reads gets Annotated separately on a different node. That way, I would not run out of disk space. And when all the files are annotated, I can do execute groupBy on each, and them simply sum the number of reads that each feature on the GFF got throughout all the files.

However, there is a slight complication to this: After the annotation using IntersectBed, my script counts the number of times a read mapped (all the different features it mapped to) and assigns divides each read by the number of times it mapped. I.e, if a read mapped to 2 regions, each instance of the read is worth 1/2, such that it would only contribute 1/2 a read to each of the features it mapped to.

Because of this, I need to have all the alignments from the .BAM file that belong to each read, contained in one single file. That is to say, I can't simply split the BAM file into 15 files, because without luck, I could end up with a 2 BAM files that have the alignments of a single read split between them, leading to the division not being correct.

How can I instruct UNIX to count a certain number of unique reads on the BAM file, output all the alignments to a new file, and continue with the rest of the BAM file, such that all reads have their n alignments contained in one single file (but shared with other reads)?

Thank you!

bam sam rna-seq unix bwa tophat bedtools • 8.6k views
ADD COMMENT
0
Entering edit mode

you need to annotate each read ? what kind of annotation to you put in each read ?

ADD REPLY
0
Entering edit mode

Pierre,

I run the mapped .bam file through IntersectBed, which assigns a feature from a GFF file to each mapped read, i.e., to what gene (based on that gene's coordinates), that read corresponds, based on its coordinates given by TopHat.

ADD REPLY
0
Entering edit mode

ok, why do you need this ? I mean, do you run any downstream analysis 'by-gene' ? Can't you do this on the whole BAM ? e.g: to get mean depth per gene,I would feed the gatk with the bam and a bed.

ADD REPLY
0
Entering edit mode

The reason I can't do this on the whole BAM is because the .BED file is too big for the available disk space, which is ~1.5TB. The BED file gets to 1.7TB and stalls or crashes.. :(

ADD REPLY
2
Entering edit mode
11.6 years ago
sjneph ▴ 690

You might consider converting the bam file to starch format, which is a highly compressed version of BED. The starch file will be smaller than the original BAM file, and there is no loss of information. Once in starch format, you can use other BEDOPS utilities to annotate things the way you want. It will take some time to convert such a large BAM file to starch format, but once done, everything else will be fast. Currently, unmapped reads are discarded, but it's pretty straightforward to modify the bam2starch script to keep them if you really want them.

GFF can also be converted to BED or starch format with no loss of information. See gff2bed/gff2starch. See BEDOPS for all of these utilities. If you go this route, use bedmap or closest-features to annotate what you want.

ADD COMMENT
1
Entering edit mode
11.6 years ago

I tried understanding what exactly you want but couldn't get it. So far what I have understood is that you want all the alignments that belong to a particular read in one bam file. If this is correct, then the sloppy way of doing it would be sort your bam file using queryname. After than you can split your bam file by making sure that all alignments end up in the same bam file.

ADD COMMENT
0
Entering edit mode

Thanks, ashutoshmits:

I'll try to make myself more clear.

I need to split the huge BAM file into say 10 files, but the only constraint is that if one read's alignment is already in file a, all of its other alignments have to be in that file as well.

I don't discard multiple alignments, because I work with transposons and repeat sequences, so it's important to have all of a reads' alignments in the same sub-file, so I can appropiately count the number of annotated alignments and divide each one by the total number for that read.

ADD REPLY
1
Entering edit mode

Hi Carmeyeii89, If you think you will be using this bam file in future then you should focus on reducing the size of the bam file first. I am not sure if this is the case with you but you can remove all the unmapped reads (where both the reads in a pair remain unmapped). Now, coming back to your question. If I would have been in your place, I would use "NH:" tag in the BAM file to split all the aligned reads into two bam files. One with the uniquely mapped alignments and the other with the multiple mapping positions. This would reduce the file size. Next , you can further split the uniquely mapped file using chromosomes (~ 20 bam files) . You wont have to worry about the splitting as these are uniquely mapped. In the end, you can do combine the results.

ADD REPLY
0
Entering edit mode

That is very clever, ashutomits! I think I will do this, and I am also working on reducing the size of the GFF file to get rid of the unnecessary entries.

ADD REPLY
1
Entering edit mode
11.6 years ago

I am adding it as a separate answer because I want to maintain the readability.

I am copying and pasting this content from another post http://comments.gmane.org/gmane.science.biology.galaxy.user/2367 :

Tophat/bowtie don’t report mapping quality values that are as meaningful as BWA, but there is some information in the mapping quality values tophat reports. Tophat yields 4 distinct values for its mapping quality values (you can do a “unique” count on the mapping quality field of any SAM file from tophat to verify this):

255 = unique mapping

3 = maps to 2 locations in the target

2 = maps to 3 locations

1 = maps to 4-9 locations

0 = maps to 10 or more

Except for the 255 case, the simple rule that was encoded by the authors is the usual phred quality scale: MapQ = -10 log10(P)

Where P = probability that this mapping is NOT the correct one. The authors ignore the number of mismatches in this calculation and simply assume that if it maps to 2 locations then P = 0.5, 3 locations implies P = 2/3, 4 locations => P = 3/4 etc.

As you can clearly see, then MapQ = -10 log10(0.5) = 3; -10 log10(2/3) = 1.76 (rounds to 2); -10 log10(3/4) = 1.25 (rounds to 1), etc.

So, what you can do now is that instead of looking for NH: tag , you can do something like

samtools view -bq 255 Whooping_1.7Tb.bam > Hopefully_1Tb_unique.bam

and you should figure out some way to get a bam with reads that are multiply mapped. Let me know if you any questions.

ADD COMMENT

Login before adding your answer.

Traffic: 1830 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6