Hi there
This is a bit conceptual but say I wanted to compare human and mouse, looking at human chr1 and chr2 and mouse chr1 and chr2. This would look like so
human
chr1 chr2
+-----------+ --------+
| XXX | |
chr1 | X | |
| X | X |
mouse | | |
+----------------------+
| | X |
chr2 | X | |
| | X |
| | |
+-----------+----------+
```
If I have data in something like a BEDPE file, naively, I could go over all the matches in the BEDPE file and see:
- If it matches human chr1 or human chr2, then see if that also matches the mouse chr1 or chr2, and it so, emit it
Or in two steps
Query for lines that match human chr1, and if so, see if it matches mouse chr1 or chr2
Query for lines that match human chr2, and if so, see if it matches mouse chr1 or chr2
So that makes it more clear that "a query for a human coordinate needs the full range of what mouse is also"
In this scenario I also don't need to query in the mouse "direction", going from the human direction is sufficient
This seems easy enough, but I am trying to consider more efficient options too, maybe where I don't have to load the whole file into memory
One idea I had was something like this involving tabix indexing. Instead of a single file, I sort it twice and make two tabix files.
sort -k1,1 -k2,2n input.bedpe > input.human.bedpe
tabix -b1 -s2 input.human.bedpe
sort -k4,4 -k5,5n input.bedpe > input.mouse.bedpe
tabix -b4 -s5 input.mouse.bedpe
Then to query, I actually do query it in both directions
tabix input.human.bedpe chr1 && tabix input.human.bepe chr2 > human_results
tabix input.mouse.bedpe chr1 && tabix input.mouse.bepe chr2 > mouse_results
intersect human_results and mouse_results > final
This final set of lines would contain my desired output I think.
This seems like it is not super efficient though because I am ending up with things like human chr1 matching to mouse chr10, which I don't care about, in my initial output before the intersection. I could also try filtering while I'm outputting so it is more like this
tabix input.human.bedpe chr1 && tabix input.human.bepe chr2 | filter_for_mouse_regions_of_interest_e.g._mouse_chr1_and_mouse_chr2 > final
This seems like a reasonable query format. It also doesn't seem like I have to query the file in both directions?
Does this seem like a reasonable system? If I had a proper database system would there be an even better way to do this? Is there any literature or keywords to look for topics like this