Question

Why Are Sam/Bam Files So Large?

3

Entering edit mode

14.2 years ago

Fixee ▴ 60

I'm a complete novice with zero background in bio; I spent the day yesterday trying to answer this question without any luck.

Reading the paper describing the SAM format, it says that the number of bps in an alignment set can exceed 100 billion for deep resequencing of a single human. Given that the human genome has about 3.3 billion bps, I would assume the reference string would be upper-bounded by this number. And assuming that "deep" means coverage of about 10x, we get 33 billion pairs, far below the number we were supposed to exceed. Diploid sequencing doubles this, but we still fall short. Questions:

What would cause us to exceed 100 billion bps?
Does a deep resequencing of a human require alignment against the 98% of the reference genome that is shared by all humans?
At 2 bits per nucleotide, the SAM file should be about 25 Gb for 100 billion bps, but these files are often 500+ Gb. Why?

To reiterate, I'm a complete novice. If you respond to this question, I would be deeply in your debt if you could use simple terminology.

sam • 20k views

ADD COMMENT • link updated 14.2 years ago by Ketil 4.2k • written 14.2 years ago by Fixee ▴ 60

score 8 · Answer 1 · 2011-06-07

SAM / BAM files contain a lot more than just the read sequence. There is the quality string, which is 1 byte per read base (in both SAM and BAM files), the cigar string, the read ID, the flag, and tags. Furthermore, in SAM files the sequence is actually 1 byte / base, in BAM files they are stored as 4 bits per base. These numbers aren't exact because BAM files are also block compressed so the number of bytes / base will be smaller, especially if the BAM file is sorted by the read target / offset.

score 3 · Answer 2 · 2011-06-07

3

Entering edit mode

14.2 years ago

Ketil 4.2k

To try to answer your questions:

100Gbp isn't difficult to achieve, Illumina HiSeq produces something like 100M reads - times two for paired ends - and I think that's just a single lane.
No, you're probably not interested in a lot of that for diagnostic purposes. But it's probably cheaper and simpler to sequence it all, rather than to try to PCR out the bits you are interested in.
Somebody already pointed to quality data, which take up a significant (and poorly compressible) chunk of the SAM format. Use 'samtools view' on a BAM file to see the contents in detail (remember to pipe output to less).

But basically, the reason files are large is that they contain lots of data. Sequencing is cheap, so we get lots of sequences.

ADD COMMENT • link 14.2 years ago by Ketil 4.2k

0

Entering edit mode

@Ketil: Illumina HiSeq-2000 produces almost 80 million paired-end reads in a single lane

ADD REPLY • link 14.2 years ago by Gww ★ 2.7k

0

Entering edit mode

Yes, but we see a large variation, with fastq files ranging from (2x) 8G to almost 30G, the largest being over 100M reads.

ADD REPLY • link 14.2 years ago by Ketil 4.2k

score 1 · Answer 3 · 2011-06-07

1

Entering edit mode

14.2 years ago

2184687-1231-83- ★ 5.1k

In many resequencing standards, "deep" means coverage of about 30-40x. Part of it is sequencing errors, but also low frequency and het SNPs that want to be found.

ADD COMMENT • link 14.2 years ago by 2184687-1231-83- ★ 5.1k

score 1 · Answer 4 · 2011-06-07

1

Entering edit mode

14.2 years ago

Jeremy Leipzig 23k

the Hsi paper shows a 1000 Genomes BAM file on one human chromosome using 17.48 bits/base (for everything not just sequence), and that a somewhat lossy, reference-based compression scheme could bring that down to an amazing .74 bits/base.

That's a huge improvement, the main drawbacks presumably being the processing time to make the file and the time penalties to use such a file.

ADD COMMENT • link 14.2 years ago by Jeremy Leipzig 23k

0

Entering edit mode

The 1000g files are huge because they keep two quality strings and a lot of other unnecessary information. If we do it right, it should cost ~10bit/base in its current form or <8bit when we merge samples. Ultimately, the reference based compression is the future. More tools will be designed to directly work such files.

ADD REPLY • link 14.2 years ago by lh3 33k

0

Entering edit mode

The 1000g files are huge because they keep two quality strings and a lot of other unnecessary information. If we do it right, it should cost ~10bit/base in its current form or <8bit when we merge samples. Ultimately, the reference based compression is the future. More tools will be designed to directly work with such files, just as more and more tools work with SAM/BAM.

ADD REPLY • link 14.2 years ago by lh3 33k