I think this should work:
#!/usr/bin/env python
import sys
d = {}
previous_stop_idx = 1
for line in sys.stdin:
elems = line.rstrip().split('\t')
start_idx = int(elems[3])
stop_idx = int(elems[4])
feature = elems[2]
for idx in range(previous_stop_idx, start_idx):
d[idx] = 'na'
for idx in range(start_idx, stop_idx + 1): # note: inclusive stop index
d[idx] = feature
previous_stop_idx = stop_idx + 1
print(d)
Example:
$ ./create_dict.py < annotations.gff
{1: 'na', 2: 'na', ..., 527: 'exon', 528: 'exon', ..., 611: 'CDS', 612: 'CDS'}
Note: This is very redundant, perhaps too much so. If the dictionary you are making ends up being too large to fit in memory, you could perhaps look at instead specifying the start point of a "contiguous block" of a feature category (e.g. na
, exon
, CDS
, etc.). This approach takes out the redundancy of consecutive keys, and so would use a lot less memory. With this structure, you could do a binary search with bisect
on the starting point, to get to a key that returns the desired feature value.
A solution of this kind is discussed here: https://stackoverflow.com/a/39358118/19410
To go with the morning coffee, I tried out a bisect
-based implementation:
#!/usr/bin/env python
import sys
import bisect
table = {}
idxs = []
previous_stop_idx = 1
na_feature = 'na'
oob_feature = 'out_of_bounds'
'''
Read standard input into the "table" dictionary
and "idxs" list.
'''
for line in sys.stdin:
elems = line.rstrip().split('\t')
start_idx = int(elems[3])
stop_idx = int(elems[4])
feature = elems[2]
if previous_stop_idx < start_idx:
table[previous_stop_idx] = na_feature
idxs.append(previous_stop_idx)
table[start_idx] = feature
idxs.append(start_idx)
previous_stop_idx = stop_idx + 1
table[previous_stop_idx] = oob_feature
idxs.append(previous_stop_idx)
'''
Query table for a given index, using a binary search on
the list "idxs".
This should return either 'na' or the desired feature, or
raise a ValueError exception for out-of-bounds conditions.
'''
def query_table_by_index(query_idx):
#
# if query_idx is less than 1, then we are out of bounds
#
if query_idx < 1:
raise ValueError("Out of bounds [{}]".format(query_idx))
#
# get the index of idxs that is to the left of query_idx
#
r = bisect.bisect(idxs, query_idx) - 1
#
# if we get the end index, then we are out of bounds ("oob_feature")
#
if r == len(idxs) - 1:
raise ValueError("Out of bounds [{}]".format(query_idx))
#
# if we are at this point, we can return the desired feature
#
k = idxs[r]
return table[k]
'''
Some sample queries
'''
#
# we expect this to fail
#
try:
print(query_table_by_index(-1))
except ValueError as e:
print(e)
pass
assert(query_table_by_index(1) == 'na')
print(query_table_by_index(1))
assert(query_table_by_index(526) == 'na')
print(query_table_by_index(526))
assert(query_table_by_index(527) == 'exon')
print(query_table_by_index(527))
assert(query_table_by_index(528) == 'exon')
print(query_table_by_index(528))
assert(query_table_by_index(609) == 'na')
print(query_table_by_index(609))
assert(query_table_by_index(610) == 'CDS')
print(query_table_by_index(610))
#
# we expect this to fail
#
try:
query_table_by_index(613)
except ValueError as e:
print(e)
pass
The output:
% ./create_dict_v2.py < annotations.gff
Out of bounds [-1]
na
na
exon
exon
na
CDS
Out of bounds [613]
This second approach will use much, much less memory than the first approach. If you have a large number of annotations, the second approach will also read in those annotations much more quickly, because Python will take less time when building a smaller dictionary.
If you have lots of queries to do, the difference between the two approaches will be in query speed.
Doing an index lookup on the first approach will be O(1)
— constant time. That's the advantage of having a dictionary, if you can afford the memory overhead.
Doing an index lookup on the second approach will take O(log n)
time to do the binary search, where n
is roughly twice the number of features (features along with the na
gaps in between). Lookups will be slightly slower, but the underlying data will use less memory.
Thank you for your recommendation. However I am getting IndexError: list index out of range on start_idx = int(elems[3]) line. I didn't understand why its happening.
Run
cat -te
on your input file. It should look something like this:There should be tab characters (
^I
) in between columns.If you have spaces in between column values, instead of a tab, then you need to replace those spaces with one tab.
Also, I added a
bisect
-based script, which should run much faster on real-world input.One more thought: If you're using GFF or BED or other similar file formats for genomic data, this approach will work on data for one chromosome or contig (i.e., the first column). It is easy to modify variables
table
etc. to handle this case. Mainly, the point here is to demonstrate reading in the data and querying it.