Hi All,
I wanted to create a sorted BAM file from the SAM file.
So these are the steps I took
samtools view -@ 8 -bhS input.sam -o mapped.bam
samtools sort -@ 8 mapped.bam -o sorted.bam
Once i created the mapped.bam file and the sorted.bam file I looked and the file sizes of BAM (mapped.bam & sorted.bam ) and saw a discrepancy. As I assumed these file sizes to be of same size, but in fact they were not.
1.5G mapped.bam
974M sorted.bam
My question is:
1) what I am doing wrong here ?
2) Is there a way to check the contents in these two files are the same ? (I am assuming ideally the contents should be same as it was just sorting them in order)
Thank you very much.
Size of BAM file reduces after sorting with samtools
Thanks Pierre Lindenbaum
Does it make any difference if I use Picard to BAM files ?
picard is slower and doesn't work the same way than samtools when sorting on queryname (!= coordinate)
The answer about sizes has already been given so I won't repeat it.
However in answer to part 2, we locally use Biobambam's bamseqchksum tool to validate that a file operation hasn't lost data in the process, or that it's lost only the bits we know will be lost. For example it can compute checksums of all the sequences and quality strings irrespective or order and hence validate they still exist and haven't been modified.
https://manpages.debian.org/unstable/biobambam2/bamseqchksum.1.en.html
That may look complex, but just do "bamseqchksum < input.bam" and you'll get some stats.