Hi, I had used STAR aligner for mapping my reads, and the output BAM files were sorted by coordinate. I used the follwing command to extract unique reads from my bam files:
samtools view -q 255 input_file.bam > unique_reads.bam
(SAM Flag 255 corresponds to unique alignments in STAR)
However, the sizes of my new bam files have increased several-fold. (For example a bam file that was originally 500 mb-900 mb have now become 2.5 gb) This has happened for all the samples.
When I am checking the number of lines in the bam files (the old one and the ones containing the unique reads), it shows that the old file (of size say 500 mb has 44 million lines) while the new file (say size 2 gb has 17 million lines). The number of lines are as expected.
I have checked in the header of both the bam files that both are sorted by coordinate.
So, could anyone tell me why the size of the file containing the lesser number of lines should be so much larger?
Agreed. Still in the most recent
samtools
versions you would not even need to set any flags as it recognizes file format based on the suffix if you use-o
instead of redirectingstdout
likesamtools view -q 255 -o unique_reads.bam input_file.bam
. WIth your current command you produced a SAM instead of BAM file without a header as-h
was missing. When using-b
then-h
is implied.