Develop a tool that divides the hg19 human genome (1-22XYM) to distribute among different cores or nodes. The goals should be:
- divide the genome such that each core has to deal with a roughly equal number of eligible base pairs (see gap note below)
- keep these intervals non-overlapping
- keep these intervals as close to large contiguous blocks as possible, for 100 cores, you can certainly have 120 intervals (someone has to get MT), but 500 intervals would be too much
take account genomic assembly gaps which are all NNN . You can include the gaps in the intervals, but they do not add to the burden and therefore should not be considered in the size calculation. Here some some gaps from UCSC table browser -> mapping and sequencing tracks -> gap
bin chrom chromStart chromEnd ix n size type bridge
0 chr1 124535434 142535434 1271 N 18000000 heterochromatin no
23 chr1 121535434 124535434 1270 N 3000000 centromere no
76 chr1 3845268 3995268 47 N 150000 contig no
85 chr1 13219912 13319912 154 N 100000 contig no
89 chr1 17125658 17175658 196 N 50000 clone yes
(Here is a copy of that table for 1:22XY)
A carefully considered metric for evaluating the solution is:
Score = (Std dev eligible bp per core) * (Number of intervals) * (Execution time in seconds) * (Lines of Code)
Lowest score wins!
Looking forward to seeing your code!
do you want the intervals to overlap ?
no, updated post