Question

How To Find Clusters Of Small Rna With Interval-Trees, Bx-Python

1

Entering edit mode

12.2 years ago

Click downvote ▴ 720

I am trying to find clusters of small RNA that are located adjacent to each other in the genome. Someone suggested I try bx-python and interval trees to do this. So, I did just that, thinking it would be self explanatory. I mapped thousands of short reads to a genome using Bowtie. Then I put these mappings into an interval tree in bx-python. But I'm not sure how to use/query the interval-tree datastructure to find what I want, namely non-overlapping, immediately adjacent reads.

I am looking for clusters of small RNA. For the reads to be considered in a cluster in the genome, one read has to start in the position x+1 if the other read ends in position x. How can I do this with bx-python and interval trees?

The only function I have found interval trees to have is interval_tree.find(261000,262000) which returns all reads in that interval, like

['r5947', '-', 'chr15', 261406, 261427, 'GGTGCCTACAGGTGCCTACAG', 'IIIIIIIIIIIIIIIIIIIII', '0']
['r6365', '-', 'chr15', 261407, 261426, 'GTGCCTACAGGTGCCTACA', 'IIIIIIIIIIIIIIIIIII', '0']
['r39867', '-', 'chr15', 261407, 261427, 'GTGCCTACAGGTGCCTACAG', 'IIIIIIIIIIIIIIIIIIII', '0']

261406 denotes the start position of read5947 and 261427 the end. So it is possible to check if some reads are adjacent by first querying the interval tree and then checking whether the reads in the result set are adjacent. But is there another, more straightforward way to do it?

bx-python interval-tree • 5.0k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 12.2 years ago by Click downvote ▴ 720

score 3 · Answer 1 · 2012-10-17

3

Entering edit mode

12.2 years ago

Brad Chapman 9.7k

You want to use ClusterTree from bx-python intervals, which handles the clustering for you:

https://bcbio.wordpress.com/2009/04/29/finding-and-displaying-short-reads-clustered-in-the-genome/

The cluster_distance parameter determines how close reads need to be to place them into a single cluster. Hope this helps.

ADD COMMENT • link 12.2 years ago by Brad Chapman 9.7k

0

Entering edit mode

Thanks, all answers appreciated. Looking at it I suspect the guy meant cluster trees and not interval trees. Btw: any more info on the cluster tree datastructure anywhere? Google returns little and I'm not certain it's about the same thing.

ADD REPLY • link 12.2 years ago by Click downvote ▴ 720

0

Entering edit mode

If you just want reads that are adjacent to each other without overlapping, then I don't think a clustertree will be correct actually. A clustertree will give you clusters of reads that overlap by X distance. You'll just get a cluster of all the reads on a reference contig if you set X to 0.

ADD REPLY • link 12.2 years ago by Damian Kao 16k

0

Entering edit mode

@Click: It's a self-balancing binary tree underneath. The code is here: https://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/intervals/cluster.pyx @Damian: The implementation clusters intervals within X distance of each other, not overlaps of X distance so I think does what Click (okay, that name is terrible) wants. X=0 would give you only clusters of reads that overlap.

ADD REPLY • link 12.2 years ago by Brad Chapman 9.7k

0

Entering edit mode

ahh I see. It's within X bp of each other. Good to know.

ADD REPLY • link 12.2 years ago by Damian Kao 16k

0

Entering edit mode

This seems to return reads that begin one position after the other ends; I specified (in a convoluted way, probably) that "For the reads to be considered in a cluster in the genome, one read has to start in the position x+1 if the other read ends in position x". Using "ClusterTree(0, 2)" seems to return reads that start at almost the same place. E.g. it returns clusters like "r5, r42" where r5 starts a 4929 and r42 starts at 4930. I want clusters where r5 ends at 4929 and r42 starts a 4930. Am I using cluster trees wrongly or what?

ADD REPLY • link 12.2 years ago by Click downvote ▴ 720

0

Entering edit mode

Yes, it will also include reads that overlap into the clusters. It's not clear to me what you want to do with those. If you have three reads: A: 1-10, B: 2-15, C: 11-20, you want A,C in a cluster but B as a separate cluster? If so, you'd need to add a second step where you separate out the overlaps in some way. Since you'll have smaller, discrete groups you should be able to implement your custom logic on those.

ADD REPLY • link 12.2 years ago by Brad Chapman 9.7k

0

Entering edit mode

I only want A and C. Will have to think about this. Thanks.

ADD REPLY • link 12.2 years ago by Click downvote ▴ 720

0

Entering edit mode

The script I wrote will do what you want. However, it will not guarantee to give you the biggest clusters. There are going to be cases where a read is adjacent to multiple other clusters. Which cluster that read will belong to will be determine by order of start coordinates in my script.

If you think about it, there are many variations to this adjacent read clustering. Do you want the variation where you maximize the biggest cluster or do you want a variation where the average cluster size is maximized...etc.

ADD REPLY • link 12.2 years ago by Damian Kao 16k

Ram · Answer 2 · 2012-10-17

If you just want adjacent reads, using an interval tree might be an overkill.

For thousands of mappings, you can probably just sort your read coordinates by start coordinate and scan through the coordinates to find your clusters. If you have millions, then I would probably not try this solution.

Let's say your data is formatted as a dictionary where key is a chromosome and value is an array of tuple with 3 elements, read id, start coordinate, end coordinate. And each array of coordinates are sorted by start coordinate.

You can use something like this:

data = {'chr1':[('read0',5,10),('read1',11,20),('read2',60,80)],'chr2':[('read3',10,20),('read4',21,50),('read5',51,230),('read6',232,150)]}

for ref, coords in data.items():
    clustered = []
    clusters = []
    for coord in coords:
        if not coord in clustered:
            used = False
            for cluster in clusters:
                if coord[1] == cluster[-1][-1] + 1:
                    cluster.append(coord)
                    clustered.append(coord)
                    used = True
            if not used:
                clusters.append([coord])

    print ref
    for cluster in clusters:
        print cluster

This script outputs:

[('read3', 10, 20), ('read4', 21, 50), ('read5', 51, 230)]
[('read6', 232, 150)]
chr1
[('read0', 5, 10), ('read1', 11, 20)]
[('read2', 60, 80)]

I am not sure how you want to handle multiple reads mapping to the same coordinates or reads that are adjacent to multiple reads.

Ram · Answer 3 · 2015-11-02

0

Entering edit mode

9.1 years ago

xieshaojun0621 ▴ 210

You should use mergeBed in the BedTools. It's an easy way.

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.1 years ago by xieshaojun0621 ▴ 210