I am trying to find clusters of small RNA that are located adjacent to each other in the genome. Someone suggested I try bx-python and interval trees to do this. So, I did just that, thinking it would be self explanatory. I mapped thousands of short reads to a genome using Bowtie. Then I put these mappings into an interval tree in bx-python. But I'm not sure how to use/query the interval-tree datastructure to find what I want, namely non-overlapping, immediately adjacent reads.
I am looking for clusters of small RNA. For the reads to be considered in a cluster in the genome, one read has to start in the position x+1 if the other read ends in position x. How can I do this with bx-python and interval trees?
The only function I have found interval trees to have is interval_tree.find(261000,262000) which returns all reads in that interval, like
['r5947', '-', 'chr15', 261406, 261427, 'GGTGCCTACAGGTGCCTACAG', 'IIIIIIIIIIIIIIIIIIIII', '0']
['r6365', '-', 'chr15', 261407, 261426, 'GTGCCTACAGGTGCCTACA', 'IIIIIIIIIIIIIIIIIII', '0']
['r39867', '-', 'chr15', 261407, 261427, 'GTGCCTACAGGTGCCTACAG', 'IIIIIIIIIIIIIIIIIIII', '0']
261406 denotes the start position of read5947 and 261427 the end. So it is possible to check if some reads are adjacent by first querying the interval tree and then checking whether the reads in the result set are adjacent. But is there another, more straightforward way to do it?
Thanks, all answers appreciated. Looking at it I suspect the guy meant cluster trees and not interval trees. Btw: any more info on the cluster tree datastructure anywhere? Google returns little and I'm not certain it's about the same thing.
If you just want reads that are adjacent to each other without overlapping, then I don't think a clustertree will be correct actually. A clustertree will give you clusters of reads that overlap by X distance. You'll just get a cluster of all the reads on a reference contig if you set X to 0.
@Click: It's a self-balancing binary tree underneath. The code is here: https://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/intervals/cluster.pyx @Damian: The implementation clusters intervals within X distance of each other, not overlaps of X distance so I think does what Click (okay, that name is terrible) wants. X=0 would give you only clusters of reads that overlap.
ahh I see. It's within X bp of each other. Good to know.
This seems to return reads that begin one position after the other ends; I specified (in a convoluted way, probably) that "For the reads to be considered in a cluster in the genome, one read has to start in the position x+1 if the other read ends in position x". Using "ClusterTree(0, 2)" seems to return reads that start at almost the same place. E.g. it returns clusters like "r5, r42" where r5 starts a 4929 and r42 starts at 4930. I want clusters where r5 ends at 4929 and r42 starts a 4930. Am I using cluster trees wrongly or what?
Yes, it will also include reads that overlap into the clusters. It's not clear to me what you want to do with those. If you have three reads: A: 1-10, B: 2-15, C: 11-20, you want A,C in a cluster but B as a separate cluster? If so, you'd need to add a second step where you separate out the overlaps in some way. Since you'll have smaller, discrete groups you should be able to implement your custom logic on those.
I only want A and C. Will have to think about this. Thanks.
The script I wrote will do what you want. However, it will not guarantee to give you the biggest clusters. There are going to be cases where a read is adjacent to multiple other clusters. Which cluster that read will belong to will be determine by order of start coordinates in my script.
If you think about it, there are many variations to this adjacent read clustering. Do you want the variation where you maximize the biggest cluster or do you want a variation where the average cluster size is maximized...etc.