Size of Output BAM file bigger than the SUM Size of Input files after merging with samtools
1
0
Entering edit mode
5.0 years ago
Giggle • 0

Hi, I'm new in bioinformatics and have a childish question about samtools~

I have 2 BAM files of the same specie, File1.bam is 286.0 MB, File2.bam is 674.8 MB.

samtools merge File_1p2.bam -c File1.bam File2.bam

After running this, File_1p2.bam is 983.0 MB bigger than the SUM of File1 and File2.

WHY??

Thank you in advance~

ChIP-Seq samtools • 1.9k views
ADD COMMENT
1
Entering edit mode

The size of BAM file doesn't necessarily need to be consistent with the number of reads, you should use command like samtools flagstat to see whether the number of reads in merged file = sum of those in File1/2

ADD REPLY
0
Entering edit mode

Thanks for this clarification!

ADD REPLY
0
Entering edit mode

Try sorting File_1p2.bam. That should reduce the size.

ADD REPLY
1
Entering edit mode
5.0 years ago
jkbonfield ★ 1.3k

I am assuming both files are chromosome / position sorted first, meaning the output will be too.

Are the two BAM files from different technologies? Compression tools such as gzip (the same algorithm is used exclusively in BAM) benefit from having similar looking data aggregated together. This is why position sorting a BAM file usually gives a smaller file (it aggregates like sequence together, although possibly makes read names less like their neighbour).

If you have very different styles of data, such as different read name patterns, different quality distributions or sequence error models, or very different sets of auxiliary tags, then you may find that the merged file moves the LZ matches ("deduplication" points) further apart and harms compression compared to the two files side by side.

It's also possible that the merge is adding per-record PG and RG tags which weren't previously necessary. You'd need to "samtools view" a few records to see if something has been added.

You're only seeing a 2-3% growth, so it's pretty subtle and wouldn't necessarily require much variation between sets to cause this. If you're worried about space, I'd advise switching to CRAM, although I'm rather biased on that point.

ADD COMMENT

Login before adding your answer.

Traffic: 2568 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6