Force tabix to index on a single column
1
0
Entering edit mode
3.4 years ago
b10hazard ▴ 30

Can I force tabix to index on a single column? I have a really large set of data (nearly 10 terabytes) that I want to random access to. A bgziped, sorted, indexed format seems to be a good way to do this, but my data has the lookup key as a single column and tabix indexes using two columns (or fields). Here is a really simple example of what I mean. My data is already sorted by the first field and in a bgzip file in this tab-delimited format...

control1 FFFFFFFFF
exemplar12 FF55FFFFF
exemplar13 FFFFFFFFF
sample14 FFFFFFFFF

The first column is the field and I want to index by. It seems like tabix needs at least two columns, the first is a string (chromosome) and the second is an integer (position). I found that I can do this by splitting the first and representing the second half as an integer like this...

control 1 FFFFFFFFF
exemplar 12 FF55FFFFF
exemplar 13 FFFFFFFFF
sample 14 FFFFFFFFF

Then it works. However, due to upstream processes (which I cannot elaborate on here) I to keep the lookup column as a single column. Can tabix be forced to only use one field to index?

tabix • 1.3k views
ADD COMMENT
0
Entering edit mode
3.4 years ago
JC 13k

You don't need tabix for that, use sort command.

ADD COMMENT
0
Entering edit mode

The data is already sorted alphabetically, I'll update my question to reflect this. Also it is a really large set of data, that I want random access to by a lookup key. I want to use tabix to index the already-sorted data by the first field only, not two fields.

ADD REPLY
0
Entering edit mode

use sort -n, that will consider only the numbers

example:

$ cat in
control1 FFFFFFFFF
sample14 FFFFFFFFF
exemplar13 FFFFFFFFF
exemplar12 FF55FFFFF
$ sort -n in 
control1 FFFFFFFFF
exemplar12 FF55FFFFF
exemplar13 FFFFFFFFF
sample14 FFFFFFFFF
ADD REPLY
0
Entering edit mode

The data is already sorted. Sorting again does not help.

ADD REPLY
0
Entering edit mode

I see, you need the index, unfortunate as you said, tabix expects one column for the chormosome name and a second one for the coordinates

ADD REPLY
0
Entering edit mode

Yup. Sorry for the confusing post. I updated it to be more descriptive. Yes, I'm looking to index a large set of data so that I have random access to it. Bgzip and tabix seem to be good options for this, but tabix seems to have two fields as a default requirement :(

ADD REPLY

Login before adding your answer.

Traffic: 1574 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6