Data filtering by creating dictionaries in python
1
0
Entering edit mode
3.9 years ago
mdgn ▴ 10

Hello,

I am trying to do partitioning my plastid alignment using Arabidopsis.gff file as reference. I wanted to first make a position file which had assigned keys and values to each position in the AT reference sequence.

NC_000932.1 RefSeq  exon    527 551 .   +   .   ID=exon-ArthCt098-1;Parent=rna-ArthCt098;Dbxref=GeneID:1466263;gbkey=tRNA;gene=trnT
NC_000932.1 RefSeq  CDS 554 570 .   +   0   ID=cds-NP_051054.1;Parent=gene-ArthCp017;Dbxref=Genbank:NP_051054.1,GeneID:844775;Name=NP_051054.1;gbkey=CDS;gene=psbD
NC_000932.1 RefSeq  CDS 610 612 .   +   0   ID=cds-NP_051055.1;Parent=gene-ArthCp018;Dbxref=Genbank:NP_051055.1,GeneID:844773;Name=NP_051055.1;Note=CP43;gbkey=CDS;gene=psbC

I wanted to define a range of numbers (or count with while loop), and give each location in the sequence a value according to the reference. So the script should search for the intervals in 4th and 5th columns and between that interval assigning each number as a key and 3rd column for respective values.

My desired output is below;

    "1": "na"
    "2": "na"
    "3": "na"
    .
    .
    "526": "na"
    "527": "exon"
    "528": "exon"
    "529": "exon"
    .
    .
    .
    "550": "exon"
    "551": "exon"
    "552": "na"
    "553": "na"
    "554": "exon"
    "555": "exon"
    .
    .
    .
   "570":"exon"
   "571":"na"
   .
   .
   .
    "610": "na"
    "611": "CDS"
    "612": "CDS"
    .
    .
    .
    until the range ends.

I believe a while loop with a count is the correct way, but I am failing to end it correctly and it goes into an endless loop. I tried directly assigning the values as strings but I realize library is what I need because the data is too big to handle. I would be glad if you can show a way to solve this.

python alignment partitioning • 1.0k views
ADD COMMENT
1
Entering edit mode
3.9 years ago

I think this should work:

#!/usr/bin/env python

import sys

d = {}
previous_stop_idx = 1
for line in sys.stdin:
    elems = line.rstrip().split('\t')
    start_idx = int(elems[3])
    stop_idx = int(elems[4])
    feature = elems[2]
    for idx in range(previous_stop_idx, start_idx):
        d[idx] = 'na'
    for idx in range(start_idx, stop_idx + 1): # note: inclusive stop index
        d[idx] = feature
    previous_stop_idx = stop_idx + 1
print(d)

Example:

$ ./create_dict.py < annotations.gff
{1: 'na', 2: 'na', ..., 527: 'exon', 528: 'exon', ..., 611: 'CDS', 612: 'CDS'}

Note: This is very redundant, perhaps too much so. If the dictionary you are making ends up being too large to fit in memory, you could perhaps look at instead specifying the start point of a "contiguous block" of a feature category (e.g. na, exon, CDS, etc.). This approach takes out the redundancy of consecutive keys, and so would use a lot less memory. With this structure, you could do a binary search with bisect on the starting point, to get to a key that returns the desired feature value.

A solution of this kind is discussed here: https://stackoverflow.com/a/39358118/19410

To go with the morning coffee, I tried out a bisect-based implementation:

#!/usr/bin/env python

import sys
import bisect

table = {}
idxs = []
previous_stop_idx = 1
na_feature = 'na'
oob_feature = 'out_of_bounds'

'''
Read standard input into the "table" dictionary
and "idxs" list.
'''
for line in sys.stdin:
    elems = line.rstrip().split('\t')
    start_idx = int(elems[3])
    stop_idx = int(elems[4])
    feature = elems[2]
    if previous_stop_idx < start_idx:
        table[previous_stop_idx] = na_feature
        idxs.append(previous_stop_idx)
    table[start_idx] = feature
    idxs.append(start_idx)
    previous_stop_idx = stop_idx + 1
table[previous_stop_idx] = oob_feature
idxs.append(previous_stop_idx)

'''
Query table for a given index, using a binary search on 
the list "idxs".

This should return either 'na' or the desired feature, or 
raise a ValueError exception for out-of-bounds conditions.
'''
def query_table_by_index(query_idx):
    #
    # if query_idx is less than 1, then we are out of bounds
    #
    if query_idx < 1:
        raise ValueError("Out of bounds [{}]".format(query_idx))
    #
    # get the index of idxs that is to the left of query_idx
    #
    r = bisect.bisect(idxs, query_idx) - 1
    #
    # if we get the end index, then we are out of bounds ("oob_feature")
    #
    if r == len(idxs) - 1:
        raise ValueError("Out of bounds [{}]".format(query_idx))
    #
    # if we are at this point, we can return the desired feature
    #
    k = idxs[r]
    return table[k]

'''
Some sample queries
'''

#
# we expect this to fail
#
try:
    print(query_table_by_index(-1))
except ValueError as e:
    print(e)
    pass

assert(query_table_by_index(1) == 'na')
print(query_table_by_index(1))

assert(query_table_by_index(526) == 'na')
print(query_table_by_index(526))

assert(query_table_by_index(527) == 'exon')
print(query_table_by_index(527))

assert(query_table_by_index(528) == 'exon')
print(query_table_by_index(528))

assert(query_table_by_index(609) == 'na')
print(query_table_by_index(609))

assert(query_table_by_index(610) == 'CDS')
print(query_table_by_index(610))

#
# we expect this to fail
#
try:
    query_table_by_index(613)
except ValueError as e:
    print(e)
    pass

The output:

% ./create_dict_v2.py < annotations.gff
Out of bounds [-1]
na
na
exon
exon
na
CDS
Out of bounds [613]

This second approach will use much, much less memory than the first approach. If you have a large number of annotations, the second approach will also read in those annotations much more quickly, because Python will take less time when building a smaller dictionary.

If you have lots of queries to do, the difference between the two approaches will be in query speed.

Doing an index lookup on the first approach will be O(1) — constant time. That's the advantage of having a dictionary, if you can afford the memory overhead.

Doing an index lookup on the second approach will take O(log n) time to do the binary search, where n is roughly twice the number of features (features along with the na gaps in between). Lookups will be slightly slower, but the underlying data will use less memory.

ADD COMMENT
0
Entering edit mode

Thank you for your recommendation. However I am getting IndexError: list index out of range on start_idx = int(elems[3]) line. I didn't understand why its happening.

ADD REPLY
0
Entering edit mode

Run cat -te on your input file. It should look something like this:

% cat -te annotations.gff
NC_000932.1^IRefSeq^Iexon^I527^I551^I.^I+^I.^IID=exon-ArthCt098-1;Parent=rna-ArthCt098;Dbxref=GeneID:1466263;gbkey=tRNA;gene=trnT$
NC_000932.1^IRefSeq^ICDS^I554^I570^I.^I+^I0^IID=cds-NP_051054.1;Parent=gene-ArthCp017;Dbxref=Genbank:NP_051054.1,GeneID:844775;Name=NP_051054.1;gbkey=CDS;gene=psbD$
NC_000932.1^IRefSeq^ICDS^I610^I612^I.^I+^I0^IID=cds-NP_051055.1;Parent=gene-ArthCp018;Dbxref=Genbank:NP_051055.1,GeneID:844773;Name=NP_051055.1;Note=CP43;gbkey=CDS;gene=psbC$

There should be tab characters (^I) in between columns.

If you have spaces in between column values, instead of a tab, then you need to replace those spaces with one tab.

Also, I added a bisect-based script, which should run much faster on real-world input.

ADD REPLY
0
Entering edit mode

One more thought: If you're using GFF or BED or other similar file formats for genomic data, this approach will work on data for one chromosome or contig (i.e., the first column). It is easy to modify variables table etc. to handle this case. Mainly, the point here is to demonstrate reading in the data and querying it.

ADD REPLY

Login before adding your answer.

Traffic: 2348 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6