Tool:Seq_File: C Library For Reading Multiple Bioinformatics Sequence File Formats
2
3
Entering edit mode
11.9 years ago

https://github.com/noporpoise/seq_file

The aim is to provide a C library that allows programs to transparently read sequence data from multiple file formats, without having to worry about the format. Pass a file to seqfileopen() and then read sequences using seqnextread() without having to worry about what the file format is.

Currently supports:

  • SAM & BAM
  • FASTA (& gzipped fasta)
  • FASTQ (& gzipped fastq)
  • 'plain' format (one sequence per line [.txt]) (& gzipped plain [.txt.gz])
fasta fastq bam sam • 4.7k views
ADD COMMENT
0
Entering edit mode

Do you have any sense of how it is better than the htslib upon which it is apprently built?

ADD REPLY
5
Entering edit mode
11.9 years ago
Isaac Turner ▴ 50

I'm the author of seq_file. It's quite a new piece of code written for use in the variant caller cortex - I didn't expect to see it pop up here. Thanks Heng for that comparison - it's very useful. seq_file was written in C to provide:

  • file reading with file format auto-detection
  • reading FASTA/FASTQ/SAM/BAM formats and others as they come along
  • unbuffered reading a line at a time for use with interactive programs

Making these trade offs has lowered the reading speed (especially the last one). On top of which optimising for speed hasn't been a priority so far. Heng's htslib is used for SAM/BAM reading. The API supports reading a base/qual pair at a time (keeps memory usage low if a whole genome FASTA file is passed) or reading an entire entry at once. There is also support for writing in different output formats.

In Heng's example I'd change:

f = seq_file_open_filetype(argv[1], SEQ_FASTA);

to:

f = seq_file_open(argv[1]);

to support format auto-detecting and multi-format support. This was also the source of the bug you saw -- that function should have been removed from the API. I'm not sure I understand Heng's comment about zlib - I'd assumed it would link against whichever version of zlib was available so I assume he means he used zlib-1.2.3.

If this is something people would like to use, optimisations could certainly be made and I'd be happy to have input. If people are looking to simply read FASTA/Q quickly in C, I'd point them towards Heng's kseq library linked above.

Isaac

ADD COMMENT
2
Entering edit mode

Your library should work with zlib-1.1.4 through zlib-1.2.7, but you will get better performance when zlib-1.2.4+ is in use. zlib-1.2.3 implements inefficient gzgetc() and gzgets() due to the use of unbuffered I/O. THE major improvement in zlib-1.2.4 is to reimplement gz*() functions. When I implemented my parser in 2008, 1.5 years before the release of zlib-1.2.4, I first wrote a buffer to overcome the inefficient gzgetc(). This buffer guarantees good performance across all zlib versions.

ADD REPLY
4
Entering edit mode
11.9 years ago
lh3 33k

Benchmark on parsing human chromosome fasta (C source code at the end):

Data             Library    RealTime    UserTime    SysTime
-----------------------------------------------------------
humanChr-plain   seq_file   127         88          38
humanChr-gzip    seq_file   246         237         10
humanChr-plain   kseq       11          8           3
humanChr-gzip    kseq       31          29          2
humanChr-plain   wc         21          9           12
humanChr-gzip    zcat|wc    38          39          18

Seq_file seems to be slower than my single-header kseq library. I have not tried on short-read FASTQ. The results may be different. Also, seq_file is linked to zlib-1.2.3. Probably using zlib-1.2.5 wil lead to better performance because in the most widely used zlib-1.2.3, I/O is not buffered.

Source code based on seq_file. Somehow the program is buggy: it always misses the first sequence in the file.

#include <stdio.h>
#include <stdlib.h>
#include "seq_file.h"

int main(int argc, char *argv[])
{
    SeqFile *f;
    StrBuf *buf;
    if (argc == 1) return 1;
    f = seq_file_open_filetype(argv[1], SEQ_FASTA);
    buf = strbuf_new();
    while (seq_next_read(f)) {
        seq_read_all_bases(f, buf);
        printf("%s\t%ld\n", seq_get_read_name(f), strbuf_len(buf));
        strbuf_reset(buf);
    }
    seq_file_close(f);
    strbuf_free(buf);
    return 1;
}

Based on kseq:

#include <zlib.h>
#include <stdio.h>
#include "kseq.h"
KSEQ_INIT(gzFile, gzread)

int main(int argc, char *argv[])
{   
    gzFile fp = gzopen(argv[1], "r");
    kseq_t *ks = kseq_init(fp);
    if (argc == 1) return 1;
    while (kseq_read(ks) >= 0)
        printf("%s\t%ld\n", ks->name.s, ks->seq.l);
    kseq_destroy(ks);
    gzclose(fp);
    return 0;
}
ADD COMMENT

Login before adding your answer.

Traffic: 1665 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6