This Python module allows fast random access to files compressed with bgzip and indexed by tabix. It includes a C extension with code from klib. The bgzip and tabix programs are available here.
Installation
https://github.com/slowkow/pytabix
$ pip install --user pytabix
or
$ wget https://pypi.python.org/packages/source/p/pytabix/pytabix-0.1.tar.gz
$ tar xf pytabix-0.1.tar.gz
$ cd pytabix-0.1
$ python setup.py install --user
Synopsis
Genomics data is often in a table where each row corresponds to a genomic region (start, end) or a position:
chrom pos snp
1 1000760 rs75316104
1 1000894 rs114006445
1 1000910 rs79750022
1 1001177 rs4970401
1 1001256 rs78650406
With tabix, you can quickly retrieve all rows in a genomic region by specifying a query with a sequence name, start, and end:
import tabix
# Open a remote or local file.
url = "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/"
url += "ALL.2of4intersection.20100804.genotypes.vcf.gz"
tb = tabix.open(url)
# These queries are identical. A query returns an iterator over the results.
records = tb.query("1", 1000000, 1250000)
records = tb.queryi(0, 1000000, 1250000)
records = tb.querys("1:1000000-1250000")
# Each record is a list of strings.
for record in records:
print record[:5]
['1', '1000760', 'rs75316104']
['1', '1000760', 'rs75316104']
['1', '1000894', 'rs114006445']
['1', '1000910', 'rs79750022']
['1', '1001177', 'rs4970401']
['1', '1001256', 'rs78650406']
Example
Let's say you have a table of gene coordinates:
$ zcat example.bed.gz | shuf | head -n5 | column -t
chr19 53611131 53636172 55786 ZNF415
chr10 72149121 72150375 221017 CEP57L1P1
chr4 185009858 185139113 133121 ENPP6
chrX 132669772 133119672 2719 GPC3
chr6 134924279 134925376 114182 FAM8A6P
Sort it by chromosome, then by start and end positions. Then, use bgzip to deflate the file into compressed blocks:
$ zcat example.bed.gz | sort -k1V -k2n -k3n | bgzip > example.bed.bgz
The compressed size is usually slightly larger than that obtained with gzip.
Index the file with tabix:
$ tabix -s 1 -b 2 -e 3 example.bed.gz
$ ls
example.bed.gz example.bed.bgz example.bed.bgz.tbi
what's the difference with the python code provided by tabix? https://github.com/samtools/tabix/blob/master/tabix.py
Thanks, Pierre: I did not see that Python code. It is important that tabix is more visible and easy for newcomers to install. The
tabixmodule.c
is the code I packaged, so you can install it with a single command. I also filled in missing documentation.