I am looking for a code logic to extract intergenic sequences based on the coordinates of the genes, and also to assign distal and proximal gene names and strats. But am stuck with overlapping complications. Could you please share code logic to address the case given below.
Gene Coordinates and Gene Details - Name and Strand
- Start - Stop GeneName Strand
- 10 - 19 Gene_1 +
- 27 - 46 Gene_2 +
- 27 - 89 Gene_3 -
- 110 - 250 Gene_4 +
- 120 - 340 Gene_5 +
- 180 - 350 Gene_6 -
- 260 - 397 Gene_7 -
- 425 - 625 Gene_8 +
- 680 - 2 Gene_9 -
Ideally this is the output I am expecting
- IGNo Start - End DistalGeneName ProximalGeneName DistalGeneStrand ProximalGeneStrand
- IG1 3 - 9 Gene_9 - Gene_1 + (Comparison with the last start and stop positions to get the actual IG coordinates)
- IG2 20 - 26 Gene_1 + Gene_3 - (In case of genes with same start coordinates the longer gene would be the proximal gene)
- IG3 90 - 109 Gene_3 - Gene_4 +
- IG4 398 - 424 Gene_7 - Gene_8 + (Here is the difficulty, how to skip the intermediate overlapping genes)
- IG5 626 - 679 Gene_8 + Gene_9 -
The overlaps in some case can be many, having difficulty to address that in logic.
If you can share a code that can resolve this or explain the logic that I can use, it would be awesome and I would be very thankful to you.
Is this a circular genome? this post might help
Yes considering it as a circular genome. Sorry didn't mention that in question.