Question

Best Way To Get A Unique List Of Items Using Two Dicts...

3

Entering edit mode

14.7 years ago

Steve Moss 2.3k

I have two dicts; one with start coordinates and the other with end coordinates, paired using an incremental integer value e.g. starts[0] = 1000 and ends[0] = 2000. I also have a third dict that holds the corresponding id e.g. ids[0] = "EXON1".

I want to get the number of unique values from both the dicts (i.e. all those that appear at least once as 1000 or 2000 may be duplicated) and have used the following to do so (benchmarks I have seen have shown it to be the fastest method):

unique_starts = {}.fromkeys(exon_seq_region_starts.values()).keys()
unique_ends = {}.fromkeys(exon_seq_region_ends.values()).keys()

I then need to iterate through the smallest of these lists and get all the values that match in the larger list for each of the unique values (e.g. ends at 2000 may have starts at 900, 950 and 1000). I need to retrieve the smallest value (e.g. 900 in this example) and its key in order to associate with the third, exon id dict.

Any ideas on the best way?

exon python map sort • 3.4k views

ADD COMMENT • link updated 14.1 years ago by Alex Reynolds 36k • written 14.7 years ago by Steve Moss 2.3k

1

Entering edit mode

@Gawbul Then why don't you create an ordered (start/end) array of object Exon(start,end) and scan from 5' to 3' each overlaping Exons ??

ADD REPLY • link 14.7 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

Any chance you could tell us what you are trying to achieve, instead of asking how to implement the method you think is best?

ADD REPLY • link 14.7 years ago by Casbon ★ 3.3k

0

Entering edit mode

Hi Casbon, I have a list of exon start and end coordinates retrieved from ensembl, but they aren't unique, in that some sequences overlap. I need to retrieve the exon ids and coordinates for the unique exons.

ADD REPLY • link 14.7 years ago by Steve Moss 2.3k

Ram · Answer 1 · 2010-08-06

I'll take your word for it. However, I do think there is a better way. Check out bedtools.

Let me know if this gives you the desired output.

from collections import namedtuple
from itertools import groupby

Bed = namedtuple("Bed", "seqid start end exonid")

def get_nonredundant_exons(beds):
    """
    >>> beds = []
    >>> beds.append(Bed("chr1", 100, 200, "e1"))
    >>> beds.append(Bed("chr1", 59, 200, "e2"))
    >>> beds.append(Bed("chr1", 59, 300, "e3"))
    >>> print list(get_nonredundant_exons(beds))
    [Bed(seqid='chr1', start=59, end=200, exonid='e2'), Bed(seqid='chr1', start=59, end=300, exonid=
'e3')]
    """
    beds.sort(key=lambda x:(x.seqid, x.end))
    for end, beds_with_same_end in groupby(beds, key=lambda x:(x.seqid, x.end)):
        yield min(beds_with_same_end, key=lambda x: x.start)

score 2 · Answer 2 · 2010-08-06

2

Entering edit mode

14.7 years ago

Pierre Lindenbaum 165k

It sounds like an bsearch/equal_range/lower_bound/upper_bound search . See google code to see an implementation, for example: http://www.sgi.com/tech/stl/stl_tree.h

ADD COMMENT • link 14.7 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

Thanks Pierre! I'll have to have a good read up on this!

ADD REPLY • link 14.7 years ago by Steve Moss 2.3k

0

Entering edit mode

just changed the last url

ADD REPLY • link 14.7 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

Yeah, I bookmarked them both, thanks :-)

ADD REPLY • link 14.7 years ago by Steve Moss 2.3k

Ram · Answer 3 · 2012-06-17

You might also take a look at the bedmap tool in the BEDOPS suite. The bedmap tool allows the user to quickly find elements of a map BED file (say, a list of exons) which overlap a reference BED file (say, a coordinate range that bounds a subset of exons). You can apply operators which include returning (or echo-ing) a nicely-formatted list of IDs from mapped elements (such as those which identify exons).