Question

Tool:pytabix - quickly retrieve rows from a table of positions or regions

1

Entering edit mode

11.1 years ago

Kamil ★ 2.3k

This Python module allows fast random access to files compressed with bgzip and indexed by tabix. It includes a C extension with code from klib. The bgzip and tabix programs are available here.

Installation

https://github.com/slowkow/pytabix

$ pip install --user pytabix

or

$ wget https://pypi.python.org/packages/source/p/pytabix/pytabix-0.1.tar.gz
$ tar xf pytabix-0.1.tar.gz
$ cd pytabix-0.1
$ python setup.py install --user

Synopsis

Genomics data is often in a table where each row corresponds to a genomic region (start, end) or a position:

chrom  pos      snp
1      1000760  rs75316104
1      1000894  rs114006445
1      1000910  rs79750022
1      1001177  rs4970401
1      1001256  rs78650406

With tabix, you can quickly retrieve all rows in a genomic region by specifying a query with a sequence name, start, and end:

import tabix

# Open a remote or local file.
url = "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/"
url += "ALL.2of4intersection.20100804.genotypes.vcf.gz"

tb = tabix.open(url)

# These queries are identical. A query returns an iterator over the results.
records = tb.query("1", 1000000, 1250000)
records = tb.queryi(0, 1000000, 1250000)
records = tb.querys("1:1000000-1250000")

# Each record is a list of strings.
for record in records:
    print record[:5]

['1', '1000760', 'rs75316104']
['1', '1000760', 'rs75316104']
['1', '1000894', 'rs114006445']
['1', '1000910', 'rs79750022']
['1', '1001177', 'rs4970401']
['1', '1001256', 'rs78650406']

Example

Let's say you have a table of gene coordinates:

$ zcat example.bed.gz | shuf | head -n5 | column -t
chr19  53611131   53636172   55786   ZNF415
chr10  72149121   72150375   221017  CEP57L1P1
chr4   185009858  185139113  133121  ENPP6
chrX   132669772  133119672  2719    GPC3
chr6   134924279  134925376  114182  FAM8A6P

Sort it by chromosome, then by start and end positions. Then, use bgzip to deflate the file into compressed blocks:

$ zcat example.bed.gz | sort -k1V -k2n -k3n | bgzip > example.bed.bgz

The compressed size is usually slightly larger than that obtained with gzip.

Index the file with tabix:

$ tabix -s 1 -b 2 -e 3 example.bed.gz
$ ls
example.bed.gz  example.bed.bgz  example.bed.bgz.tbi

tabix C python • 5.1k views

ADD COMMENT • link updated 22 months ago by Ram 45k • written 11.1 years ago by Kamil ★ 2.3k

1

Entering edit mode

what's the difference with the python code provided by tabix? https://github.com/samtools/tabix/blob/master/tabix.py

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 11.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks, Pierre: I did not see that Python code. It is important that tabix is more visible and easy for newcomers to install. The tabixmodule.c is the code I packaged, so you can install it with a single command. I also filled in missing documentation.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 11.1 years ago by Kamil ★ 2.3k