Can I force tabix to index on a single column? I have a really large set of data (nearly 10 terabytes) that I want to random access to. A bgziped, sorted, indexed format seems to be a good way to do this, but my data has the lookup key as a single column and tabix indexes using two columns (or fields). Here is a really simple example of what I mean. My data is already sorted by the first field and in a bgzip file in this tab-delimited format...
control1 FFFFFFFFF
exemplar12 FF55FFFFF
exemplar13 FFFFFFFFF
sample14 FFFFFFFFF
The first column is the field and I want to index by. It seems like tabix needs at least two columns, the first is a string (chromosome) and the second is an integer (position). I found that I can do this by splitting the first and representing the second half as an integer like this...
control 1 FFFFFFFFF
exemplar 12 FF55FFFFF
exemplar 13 FFFFFFFFF
sample 14 FFFFFFFFF
Then it works. However, due to upstream processes (which I cannot elaborate on here) I to keep the lookup column as a single column. Can tabix be forced to only use one field to index?
The data is already sorted alphabetically, I'll update my question to reflect this. Also it is a really large set of data, that I want random access to by a lookup key. I want to use tabix to index the already-sorted data by the first field only, not two fields.
use
sort -n
, that will consider only the numbersexample:
The data is already sorted. Sorting again does not help.
I see, you need the index, unfortunate as you said, tabix expects one column for the chormosome name and a second one for the coordinates
Yup. Sorry for the confusing post. I updated it to be more descriptive. Yes, I'm looking to index a large set of data so that I have random access to it. Bgzip and tabix seem to be good options for this, but tabix seems to have two fields as a default requirement :(