Question

Indexing Compressed Fastq-Like Data

6

Entering edit mode

14.4 years ago

2184687-1231-83- ★ 5.1k

Hi,

I would like to ask what is the best option to have FASTQ-like information indexed and compressed at the same time. I have seen cdbfasta/cdbyank can index fastq files but not if they are compressed, and I've also seen there are fastq2bam conversion tools that can create unaligned bam files, which I suppose can be indexed.

What is the best option that is a good compromise between speed/space taken? Does anybody know of any tool that will allow retrieving indexed sequences by sequence itself as well as by sequence id?

Cheers

fastq • 9.2k views

ADD COMMENT • link updated 18 months ago by Ram 44k • written 14.4 years ago by 2184687-1231-83- ★ 5.1k

Ram · Answer 1 · 2010-08-06

4

Entering edit mode

14.4 years ago

Aaron Statham ★ 1.1k

Doesn't look like BAM will be able to index unaligned reads, quote from the paper Pierre linked to:

A position-sorted BAM file can be indexed.

You could try putting your reads into a database (eg I use sqlite3 for some projects) and index the table by the read ID and sequence

ADD COMMENT • link updated 5.7 years ago by Ram 44k • written 14.4 years ago by Aaron Statham ★ 1.1k

Ram · Answer 2 · 2010-08-06

As mentioned in other answers, BAM only fulfills part of your requirement (compression and random access), but not indexing. However, you can easily roll your own index using the BGZF API and your key-value store of choice e.g. Berkeley DB.

Here's an example using my cl-sam API, but you can substitute e.g. the C API that comes with Samtools or the Java API from Picard (or a Swig wrapper). I'm just writing data to a text file, instead of BDB, but you get the idea...

(defun bamdex (bam-file index-file)
  (with-open-file (index index-file :direction :output)
    (with-bgzf (bgzf bam-file :direction :input)
      (read-bam-meta bgzf)
      (loop
         for offset = (bgzf-tell bgzf)
         for record = (read-alignment bgzf)
         while record
         do (format index "~s ~d~%" (read-name record) offset)))))

Keys and values:

"ENSDART00000000005_480_670_12d" 2269928771

The key is the read name, the number is the virtual offset into the uncompressed data (see the SAM spec). Use with a BGZF seek to reach the record:

(with-bgzf (bgzf "test.bam")
  (bgzf-seek bgzf 2269928771)
  (read-name (read-alignment bgzf)))

gives

"ENSDART00000000005_480_670_12d"

If the reads are very long, you might consider using a sequence checksum as a key instead of the actual sequence.

Ram · Answer 3 · 2010-08-06

4

Entering edit mode

14.4 years ago

Haibao Tang 3.0k

If you care about the actual implementation - there is an excellent post by brentp on how to index files in general. Formats like FASTA, FASTQ and SAM were used as examples.

ADD COMMENT • link updated 5.7 years ago by Ram 44k • written 14.4 years ago by Haibao Tang 3.0k

0

Entering edit mode

This doesn't work on compressed data. Also, it uses TokyoCabinet BDB which I've found doesn't scale to the number of records found in real data (>10^7). That's why I suggested Berkeley DB.

ADD REPLY • link updated 5.7 years ago by Ram 44k • written 14.4 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

@Keith, have you found that problem with BDB? I tested with 20million records and saw no slow-down? HDB, on the other hand slows down at 10^6. and tokyo-cabinet can compress the data.

ADD REPLY • link updated 5.7 years ago by Ram 44k • written 14.4 years ago by brentp 24k

0

Entering edit mode

Yes, BDB. See timings here using the C API directly. Each time is for the insertion of 10^6 records. You can see that it takes 3 sec to start, with, but takes 180 sec once there are 10^7 records. Compression helps somewhat at this point, but it's still not good. The time taken increases exponentially. What tuning parameters do you use?

ADD REPLY • link updated 5.7 years ago by Ram 44k • written 14.4 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

Would anyone know where that blog post is hosted these days?

ADD REPLY • link updated 5.7 years ago by Ram 44k • written 7.8 years ago by Andreas ★ 2.5k

0

Entering edit mode

web archive link

Look for fileindex on April 10, 2010

ADD REPLY • link updated 5.7 years ago by Ram 44k • written 7.5 years ago by samesense ▴ 40

Ram · Answer 4 · 2010-08-06

3

Entering edit mode

14.4 years ago

Pierre Lindenbaum 164k

I've no answer to that question but I know that the BAM format uses a special version of the zlib (BGZF) that divide the compressed file into some indexed chunk of data:

The advantage of BGZF over conventional gzip is that BGZF allows for seeking without having to scan through the entire file up to the position being sought.

See this article

ADD COMMENT • link updated 5.7 years ago by Ram 44k • written 14.4 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Doesn't look like BAM will be able to index unaligned reads, quote from the paper Pierre linked to:

A position-sorted BAM file can be indexed.

You could possibly try putting your reads into a database (eg I use sqlite3 for some projects) and index the table by the read ID and sequence

ADD REPLY • link updated 5.7 years ago by Ram 44k • written 14.4 years ago by Aaron Statham ★ 1.1k

Ram · Answer 5 · 2010-08-06

2

Entering edit mode

14.4 years ago

Casbon ★ 3.3k

Doesn't screed do something like this?

Docs

Shameless self-promoting link to my influence on this project

ADD COMMENT • link updated 5.7 years ago by Ram 44k • written 14.4 years ago by Casbon ★ 3.3k

0

Entering edit mode

This looks interesting but I don't see compression mentioned by quickly reading the document. Ideally, I would like to have both compression and fast access by seqID (and also sequence if possible).

ADD REPLY • link 14.4 years ago by 2184687-1231-83- ★ 5.1k

Ram · Answer 6 · 2010-08-09

Sorry if this is more an answer to: "What will, probably, be a good solution in the future". But as I saw, this problem is not really solved yet I put this in anyway.

The BioHDF project is possibly on the way to accomplish this, however it is still pretty much a prototype. It relies on the HDF5 library which natively features compression.

This README shows how a projected work-flow could look.

From their site:

New Features (April 2010)

NCList-based indexing for fast and accurate queries (reference)

SAM/BAM import and export

Better large dataset support (with more later in the month)

I am not sure if they are really already supporting indexing of FastQ/X sequences, but I guess in their ISMB 2009 presentation they said they would. The project has potential but is progressing rather slowly.

Ram · Answer 7 · 2011-12-07

1

Entering edit mode

13.1 years ago

Peter 6.0k

How about using BGZF (Blocked GNU Zip Format), the same mechanism as BAM files? This is a variant of GZIP but using 64KB blocks of data. See this blog post.

In principle you could also use BZIP2, which is also block based compression, but its better compression comes at the cost of higher CPU both for compression and decompression (see here).

With either of the above, you'd need a separate index file mapping the keys (e.g. record ID) to the block double offset, as explained in the above blog posts, which links to a proof of principle implementation of this on one of my Biopython branches of github, building on Biopython's existing sequencing indexing infrastructure (e.g. in memory mapping, or using SQLite3).

ADD COMMENT • link updated 5.7 years ago by Ram 44k • written 13.1 years ago by Peter 6.0k

0

Entering edit mode

I see now Pierre had also suggest BGZF

ADD REPLY • link updated 5.7 years ago by Ram 44k • written 13.1 years ago by Peter 6.0k

0

Entering edit mode

Looking back over the other answers, Keith James also suggested BGZF.

ADD REPLY • link updated 5.7 years ago by Ram 44k • written 13.0 years ago by Peter 6.0k