Hi!
I am trying to make smaller many BAM files (around 60 of them) of size ~200GB (due to disk space limitations) by removing base qualities and tags and other unwanted information. Doing copy number analysis, for me base qualities and tags and duplicates are somehow unwanted information. What I only care about is the mapping quality (MAPQ) since I filter low quality reads!
Currently, I am using bamUtils squeeze command. I don't know yet how good this tool is in making the bam file smaller! The squeeze sub command can replace QNAME with an integer, remove duplicates, and remove OQ tag (i.e. original base qualities) but not the QUAL field. However, for the QUAL field, the tool provides the binning option (to reduce the number of possible quality scores).
Previously, I used cgat bam2bam method=strip-quality which deletes only the QUAL field. This tool is slow (takes ~12 hours for a 160 GB BAM file) and didn't free much space. The modified BAM file was only 3GB smaller for a 160GB file.
I was wondering if deleting whatever comes after the SEQ (in a SAM/BAM file) will work (i.e QUAL and all other tags)? and if yes, what would be the fastest way to apply that? Or, if there is a tool available that I was not able to find?
Thanks in advance for sharing your ideas!
EDIT1: My question might have been misleading since I said "I only care about MAPQ". I also care about the FLAG and SEQ. Since later in the pipeline, I will call variants; but there only FLAG, MAPQ and SEQ are needed and not any thing else.
EDIT2: Now, I have the result from using bamUtils squeeze and to me the result is satisfactory. The BAM file is ~4-fold smaller when one:
- removes the OQ tag,
- removes the duplicates, and
- bin the base quality scores
And it took less than 4 hours (3:51) to squeeze a 160GB file to 43.5GB.
While its possible to delete data in a BAM file, this question makes me a little uneasy because it sounds like you're trying to fit a round peg into a square hole. Why not keep the raw BAM data on DVDs or something, and then just extract the information you need/want (in a different file format).
For example, for CNV you might just want to know how many reads map to each spot on the genome. In that case, just make a BED/BigWig file? If you need the MAPQ, you could make a BED file of the running MAPQ average/max/min. Two BED files of that sort would most likely be less than 100Mb each :)
Just a suggestion - I dont know your downstream steps so maybe none of this is relevant.
That is a plausible thought and I guess in the end I will end up doing something like: download a batch of BAM files (as far as the disk space permits), do the downstream analysis, delete the BAM files, start a new batch!