Programming Challenge: Divide The Human Genome Among X Cores, Taking Into Account Gaps
3
4
Entering edit mode
11.7 years ago

Develop a tool that divides the hg19 human genome (1-22XYM) to distribute among different cores or nodes. The goals should be:

  • divide the genome such that each core has to deal with a roughly equal number of eligible base pairs (see gap note below)
  • keep these intervals non-overlapping
  • keep these intervals as close to large contiguous blocks as possible, for 100 cores, you can certainly have 120 intervals (someone has to get MT), but 500 intervals would be too much
  • take account genomic assembly gaps which are all NNN . You can include the gaps in the intervals, but they do not add to the burden and therefore should not be considered in the size calculation. Here some some gaps from UCSC table browser -> mapping and sequencing tracks -> gap

    bin chrom chromStart chromEnd ix n size type bridge

    0 chr1 124535434 142535434 1271 N 18000000 heterochromatin no

    23 chr1 121535434 124535434 1270 N 3000000 centromere no

    76 chr1 3845268 3995268 47 N 150000 contig no

    85 chr1 13219912 13319912 154 N 100000 contig no

    89 chr1 17125658 17175658 196 N 50000 clone yes

(Here is a copy of that table for 1:22XY)

A carefully considered metric for evaluating the solution is:

Score = (Std dev eligible bp per core) * (Number of intervals) * (Execution time in seconds) * (Lines of Code)

Lowest score wins!

Looking forward to seeing your code!

programming • 6.0k views
ADD COMMENT
0
Entering edit mode

do you want the intervals to overlap ?

ADD REPLY
0
Entering edit mode

no, updated post

ADD REPLY
5
Entering edit mode
11.7 years ago

Here is my solution. I put my code on github https://github.com/lindenb/jvarkit#biostar77828 https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/biostar/Biostar77828.java (requires the picard library)

The makefile: I use bedtools to substract the gaps from hg19. I removed the "*hap" chromosomes.

all: tmp.result.txt

tmp.result.txt : dist/biostar77828.jar tmp3.bed
    java -jar dist/biostar77828.jar VERBOSITY=DEBUG N_ITERATIONS=10000000 MIN_CORE=20 MAX_CORE=30 < tmp3.bed > $@

dist/biostar77828.jar: src/main/java/com/github/lindenb/jvarkit/tools/biostar/Biostar77828.java
     ant biostar77828

tmp3.bed: tmp1.bed tmp2.bed
    /commun/data/packages/BEDTools-Version-2.16.2/bin/subtractBed -a tmp1.bed -b tmp2.bed > $@

tmp1.bed:
     curl  "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz" |\
         gunzip -c |\
        grep -v hap |\
        awk -F '    ' '{printf("%s\t0\t%s\n",$$1,$$2);}' |\
         LC_ALL=C sort -t '    ' -k1,1 -k2,2n -k3,3n > $@
tmp2.bed:
     curl  "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz" |\
         gunzip -c |\
        grep -v hap |\
        cut -d '    ' -f2,3,4 |\
         LC_ALL=C sort -t '    ' -k1,1 -k2,2n -k3,3n > $@

I use a strategy using a random generator of solution, looping for N generations and printing the best one:

        Solution best=null;
        for(long generation=0;generation< this.N_ITERATIONS;++generation)
            {
            Solution sol=createSolution();

            if(best==null || sol.compareTo(best)<0)
                {
                best=sol;
                }
            }

Creating a solution: a random number of 'Core' objects is created and filled with a random segment.

    int n_cores=
            MIN_CORE+(MIN_CORE>=MAX_CORE?0:this.random.nextInt(MAX_CORE-MIN_CORE))
    (...)
    Collections.shuffle(segments, this.random);
    (...)
    for(int i=0;i< n_cores && i< segments.size();++i)
        {
        //get last
        Core core=new Core();
        core.segments.add(segments.get(i));
        sol.cores.add(core);
        }

then each BED segment is added in a Core with the smallest variation of the mean number of bases

            for(Core core:sol.cores)
                {
                if(best==null ||
                    (Math.abs((core.length()+seg.size())-mean) < Math.abs((best.length()+seg.size())-mean)))
                    {
                    best=core;
                    }

                }
            best.segments.add(seg);

output:

https://gist.github.com/lindenb/6130880/#file-biostar77828-bed

ADD COMMENT
3
Entering edit mode
11.7 years ago

I haven't calculated scores, but I applied a first-fit algorithm on a regions-minus-gaps file generated with BEDOPS, where the regions are either sorted in descending order by size or shuffled randomly, before binning.

Here are the hg19 gaps I used:

chr1 0 10000
chr1 177417 227417
chr1 267719 317719
chr1 471368 521368
chr1 2634220 2684220
chr1 3845268 3995268
chr1 13052998 13102998
chr1 13219912 13319912
chr1 13557162 13607162
chr1 17125658 17175658
chr1 29878082 30028082
chr1 103863906 103913906
chr1 120697156 120747156
chr1 120936695 121086695
chr1 121485434 121535434
chr1 121535434 124535434
chr1 124535434 142535434
chr1 142731022 142781022
chr1 142967761 143117761
chr1 143292816 143342816
chr1 143544525 143644525
chr1 143771002 143871002
chr1 144095783 144145783
chr1 144224481 144274481
chr1 144401744 144451744
chr1 144622413 144672413
chr1 144710724 144810724
chr1 145833118 145883118
chr1 146164650 146214650
chr1 146253299 146303299
chr1 148026038 148176038
chr1 148361358 148511358
chr1 148684147 148734147
chr1 148954460 149004460
chr1 149459645 149509645
chr1 205922707 206072707
chr1 206332221 206482221
chr1 223747846 223797846
chr1 235192211 235242211
chr1 248908210 249058210
chr1 249240621 249250621
chr10 0 10000
chr10 10000 60000
chr10 17974675 18024675
chr10 38818835 38868835
chr10 39154935 39254935
chr10 39254935 42254935
chr10 42254935 42354935
chr10 42546687 42596687
chr10 46426964 46476964
chr10 47429169 47529169
chr10 47792476 47892476
chr10 48055707 48105707
chr10 49095536 49195536
chr10 51137410 51187410
chr10 51398845 51448845
chr10 125869472 125919472
chr10 128616069 128766069
chr10 133381404 133431404
chr10 133677527 133727527
chr10 135524747 135534747
chr11 0 10000
chr11 10000 60000
chr11 1162759 1212759
chr11 50783853 50833853
chr11 50833853 51040853
chr11 51040853 51090853
chr11 51594205 51644205
chr11 51644205 54644205
chr11 54644205 54694205
chr11 69089801 69139801
chr11 69724695 69774695
chr11 87688378 87738378
chr11 96287584 96437584
chr11 134946516 134996516
chr11 134996516 135006516
chr12 0 10000
chr12 10000 60000
chr12 95739 145739
chr12 7189876 7239876
chr12 34856694 37856694
chr12 109373470 109423470
chr12 122530623 122580623
chr12 132706992 132806992
chr12 133841895 133851895
chr13 0 10000
chr13 10000 16000000
chr13 16000000 19000000
chr13 19000000 19020000
chr13 86760324 86910324
chr13 112353994 112503994
chr13 114325993 114425993
chr13 114639948 114739948
chr13 115109878 115159878
chr13 115159878 115169878
chr14 0 10000
chr14 10000 16000000
chr14 16000000 19000000
chr14 107289540 107339540
chr14 107339540 107349540
chr15 0 10000
chr15 10000 17000000
chr15 17000000 20000000
chr15 20894633 20935075
chr15 21398819 21885000
chr15 22212114 22262114
chr15 22596193 22646193
chr15 23514853 23564853
chr15 29159443 29209443
chr15 82829645 82879645
chr15 84984473 85034473
chr15 102521392 102531392
chr16 0 10000
chr16 10000 60000
chr16 8636921 8686921
chr16 34023150 34173150
chr16 35285801 35335801
chr16 35335801 38335801
chr16 38335801 46335801
chr16 46335801 46385801
chr16 88389383 88439383
chr16 90294753 90344753
chr16 90344753 90354753
chr17 296626 396626
chr17 21566608 21666608
chr17 22263006 25263006
chr17 34675848 34725848
chr17 62410760 62460760
chr17 77546461 77596461
chr17 79709049 79759049
chr17_ctg5_hap1 1256794 1306794
chr17_ctg5_hap1 1588968 1638968
chr18 0 10000
chr18 15410898 15460898
chr18 15460898 18460898
chr18 18460898 18510898
chr18 52059136 52209136
chr18 72283353 72333353
chr18 75721820 75771820
chr18 78017248 78067248
chr18 78067248 78077248
chr19 0 10000
chr19 10000 60000
chr19 7346004 7396004
chr19 8687198 8737198
chr19 20523415 20573415
chr19 24631782 24681782
chr19 24681782 27681782
chr19 27681782 27731782
chr19 59118983 59128983
chr2 0 10000
chr2 3529312 3579312
chr2 5018788 5118788
chr2 16279724 16329724
chr2 21153113 21178113
chr2 87668206 87718206
chr2 89630436 89830436
chr2 90321525 90371525
chr2 90545103 91545103
chr2 91545103 91595103
chr2 92326171 95326171
chr2 110109337 110251337
chr2 149690582 149790582
chr2 234003741 234053741
chr2 239801978 239831978
chr2 240784132 240809132
chr2 243102476 243152476
chr2 243189373 243199373
chr20 0 10000
chr20 10000 60000
chr20 26319569 26369569
chr20 26369569 29369569
chr20 29369569 29419569
chr20 29653908 29803908
chr20 34897085 34947085
chr20 61091437 61141437
chr20 61213369 61263369
chr20 62965520 63015520
chr20 63015520 63025520
chr21 0 10000
chr21 10000 5211193
chr21 5211193 9411193
chr21 9595548 9645548
chr21 9775437 9825437
chr21 10034920 10084920
chr21 10215976 10365976
chr21 10647896 10697896
chr21 11188129 11238129
chr21 11238129 11288129
chr21 11288129 14288129
chr21 14288129 14338129
chr21 42955559 43005559
chr21 44632664 44682664
chr21 48119895 48129895
chr22 0 10000
chr22 10000 13000000
chr22 13000000 16000000
chr22 16000000 16050000
chr22 16697850 16847850
chr22 20509431 20609431
chr22 50364777 50414777
chr22 51244566 51294566
chr22 51294566 51304566
chr3 0 10000
chr3 10000 60000
chr3 66170270 66270270
chr3 90504854 93504854
chr3 194041961 194047251
chr3 197962430 198012430
chr3 198012430 198022430
chr4 0 10000
chr4 1423146 1478646
chr4 8799203 8818203
chr4 9274642 9324642
chr4 31820917 31837417
chr4 32834638 32840638
chr4 40296396 40297096
chr4 49338941 49488941
chr4 49660117 52660117
chr4 59739333 59789333
chr4 75427379 75452279
chr4 191044276 191144276
chr4 191144276 191154276
chr5 0 10000
chr5 17530657 17580657
chr5 46405641 49405641
chr5 91636128 91686128
chr5 138787073 138837073
chr5 155138727 155188727
chr5 180905260 180915260
chr6 0 10000
chr6 10000 60000
chr6 58087659 58137659
chr6 58780166 58830166
chr6 58830166 61830166
chr6 61830166 61880166
chr6 62128589 62178589
chr6 95680543 95830543
chr6 157559467 157609467
chr6 157641300 157691300
chr6 167942073 168042073
chr6 170279972 170329972
chr6 171055067 171105067
chr6 171105067 171115067
chr6_apd_hap1 207807 252060
chr6_apd_hap1 327074 356458
chr6_apd_hap1 448574 536172
chr6_apd_hap1 588761 692953
chr6_apd_hap1 944435 963558
chr6_apd_hap1 970040 1029417
chr6_apd_hap1 1173934 1198528
chr6_apd_hap1 1307951 1317697
chr6_apd_hap1 1341208 1344139
chr6_apd_hap1 1451581 1556172
chr6_apd_hap1 1582549 1598989
chr6_apd_hap1 1736596 1786349
chr6_apd_hap1 1833811 1937682
chr6_apd_hap1 2009922 2164480
chr6_apd_hap1 2169002 2254169
chr6_apd_hap1 2359974 2761234
chr6_apd_hap1 2821679 2858431
chr6_apd_hap1 2894657 3037482
chr6_apd_hap1 3122973 3136480
chr6_apd_hap1 3180323 3355295
chr6_apd_hap1 3370179 3415446
chr6_apd_hap1 3491066 3594101
chr6_apd_hap1 3638667 3642320
chr6_apd_hap1 3696623 3736194
chr6_apd_hap1 3762870 3766192
chr6_apd_hap1 3790366 3970602
chr6_apd_hap1 4220467 4274176
chr6_apd_hap1 4390029 4597885
chr6_dbb_hap3 138055 245960
chr6_dbb_hap3 622457 640380
chr6_dbb_hap3 1337861 1344119
chr6_dbb_hap3 1836690 1837200
chr6_dbb_hap3 2117295 2119813
chr6_dbb_hap3 2606257 2723615
chr6_dbb_hap3 3407000 3481304
chr6_dbb_hap3 3736386 3772614
chr6_dbb_hap3 4206542 4226964
chr6_dbb_hap3 4354126 4376794
chr6_mann_hap4 145852 184243
chr6_mann_hap4 1388812 1438812
chr6_mann_hap4 2047420 2056121
chr6_mann_hap4 2252213 2277026
chr6_mann_hap4 2312291 2324610
chr6_mann_hap4 2517052 2540962
chr6_mann_hap4 2772006 2810953
chr6_mann_hap4 2921096 2929556
chr6_mann_hap4 3062107 3151453
chr6_mann_hap4 3174271 3227203
chr6_mann_hap4 3261361 3374492
chr6_mann_hap4 3412258 3450867
chr6_mann_hap4 3553455 3557741
chr6_mann_hap4 3895092 3926341
chr6_mann_hap4 4236950 4245095
chr6_mann_hap4 4441658 4480941
chr6_mcf_hap5 135141 138043
chr6_mcf_hap5 179429 186757
chr6_mcf_hap5 229133 282732
chr6_mcf_hap5 383988 404415
chr6_mcf_hap5 491590 560421
chr6_mcf_hap5 676599 685588
chr6_mcf_hap5 994250 1009296
chr6_mcf_hap5 1177759 1292220
chr6_mcf_hap5 1562952 1726699
chr6_mcf_hap5 1815769 1838106
chr6_mcf_hap5 2091199 2189158
chr6_mcf_hap5 2241824 2251998
chr6_mcf_hap5 2300850 2305647
chr6_mcf_hap5 2712451 2811903
chr6_mcf_hap5 2947576 2958999
chr6_mcf_hap5 3154498 3188719
chr6_mcf_hap5 3329672 3355524
chr6_mcf_hap5 3602941 3659279
chr6_mcf_hap5 3899294 3929849
chr6_mcf_hap5 4149579 4149626
chr6_mcf_hap5 4284353 4302274
chr6_mcf_hap5 4422172 4594253
chr6_qbl_hap6 521900 681210
chr6_qbl_hap6 2680914 2732064
chr6_qbl_hap6 3023176 3049607
chr6_qbl_hap6 3316246 3369039
chr6_qbl_hap6 3993031 4020006
chr6_ssto_hap7 861800 941242
chr6_ssto_hap7 1727717 1744408
chr6_ssto_hap7 1836632 1838892
chr6_ssto_hap7 1983561 1997824
chr6_ssto_hap7 2052735 2087415
chr6_ssto_hap7 2101474 2113875
chr6_ssto_hap7 2138491 2303018
chr6_ssto_hap7 2411247 2461193
chr6_ssto_hap7 3040184 3054966
chr6_ssto_hap7 3093568 3154699
chr6_ssto_hap7 3191196 3214366
chr6_ssto_hap7 3368931 3445987
chr6_ssto_hap7 4370564 4420564
chr6_ssto_hap7 4580410 4610218
chr6_ssto_hap7 4639806 4661556
chr6_ssto_hap7 4723509 4774367
chr6_ssto_hap7 4783798 4836049
chr7 0 10000
chr7 232484 282484
chr7 50370631 50410631
chr7 58054331 61054331
chr7 61310513 61360513
chr7 61460465 61510465
chr7 61677020 61727020
chr7 61917157 61967157
chr7 74715724 74765724
chr7 100556043 100606043
chr7 130154523 130254523
chr7 139379377 139404377
chr7 142048195 142098195
chr7 142276197 142326197
chr7 143347897 143397897
chr7 154270634 154370634
chr7 159128663 159138663
chr8 0 10000
chr8 7474649 7524649
chr8 12091854 12141854
chr8 43838887 46838887
chr8 48130499 48135599
chr8 86576451 86726451
chr8 142766515 142816515
chr8 145332588 145432588
chr8 146304022 146354022
chr8 146354022 146364022
chr9 0 10000
chr9 39663686 39713686
chr9 39974796 40024796
chr9 40233029 40283029
chr9 40425834 40475834
chr9 40940341 40990341
chr9 41143214 41193214
chr9 41365793 41415793
chr9 42613955 42663955
chr9 43213698 43313698
chr9 43946569 43996569
chr9 44676646 44726646
chr9 44908293 44958293
chr9 45250203 45350203
chr9 45815521 45865521
chr9 46216430 46266430
chr9 46461039 46561039
chr9 47060133 47160133
chr9 47317679 47367679
chr9 47367679 50367679
chr9 50367679 65367679
chr9 65367679 65467679
chr9 65918360 65968360
chr9 66192215 66242215
chr9 66404656 66454656
chr9 66614195 66664195
chr9 66863343 66913343
chr9 67107834 67207834
chr9 67366296 67516296
chr9 67987998 68137998
chr9 68514181 68664181
chr9 68838946 68988946
chr9 69278385 69328385
chr9 70010542 70060542
chr9 70218729 70318729
chr9 70506535 70556535
chr9 70735468 70835468
chr9 92343416 92443416
chr9 92528796 92678796
chr9 133073060 133223060
chr9 137041193 137091193
chr9 139166997 139216997
chr9 141153431 141203431
chr9 141203431 141213431
chrX 0 10000
chrX 10000 60000
chrX 94821 144821
chrX 231384 281384
chrX 1047557 1097557
chrX 1134113 1184113
chrX 1264234 1314234
chrX 2068238 2118238
chrX 7623882 7673882
chrX 10738674 10788674
chrX 37098256 37148256
chrX 49242997 49292997
chrX 49974173 50024173
chrX 52395914 52445914
chrX 58582012 58632012
chrX 58632012 61632012
chrX 61632012 61682012
chrX 76653692 76703692
chrX 113517668 113567668
chrX 115682290 115732290
chrX 120013235 120063235
chrX 143507324 143557324
chrX 148906424 148956424
chrX 149032062 149082062
chrX 152277099 152327099
chrX 155260560 155270560
chrY 0 10000
chrY 44821 94821
chrY 181384 231384
chrY 997557 1047557
chrY 1084113 1134113
chrY 1214234 1264234
chrY 2018238 2068238
chrY 8914955 8964955
chrY 9241322 9291322
chrY 10104553 13104553
chrY 13143954 13193954
chrY 13748578 13798578
chrY 20143885 20193885
chrY 22369679 22419679
chrY 23901428 23951428
chrY 28819361 58819361
chrY 58917656 58967656
chrY 59363566 59373566
view raw hg19.gaps.bed hosted with ❤ by GitHub

Here are the hg19 regions ("extents") I used:

chr1 0 249250621
chr10 0 135534747
chr11 0 135006516
chr12 0 133851895
chr13 0 115169878
chr14 0 107349540
chr15 0 102531392
chr16 0 90354753
chr17 0 81195210
chr18 0 78077248
chr19 0 59128983
chr2 0 243199373
chr20 0 63025520
chr21 0 48129895
chr22 0 51304566
chr3 0 198022430
chr4 0 191154276
chr5 0 180915260
chr6 0 171115067
chr7 0 159138663
chr8 0 146364022
chr9 0 141213431
chrM 0 16571
chrX 0 155270560
chrY 0 59373566

To generate the regions-minus-gaps file, adding size values:

$ bedops --difference hg19.extents.bed hg19.gaps.bed \
    | awk '{ print $0"\t"($3-$2) }' - \
    > hg19.gapped_extents.bed

Here is the regions-minus-gaps file that I get:

chr1 10000 177417 167417
chr1 227417 267719 40302
chr1 317719 471368 153649
chr1 521368 2634220 2112852
chr1 2684220 3845268 1161048
chr1 3995268 13052998 9057730
chr1 13102998 13219912 116914
chr1 13319912 13557162 237250
chr1 13607162 17125658 3518496
chr1 17175658 29878082 12702424
chr1 30028082 103863906 73835824
chr1 103913906 120697156 16783250
chr1 120747156 120936695 189539
chr1 121086695 121485434 398739
chr1 142535434 142731022 195588
chr1 142781022 142967761 186739
chr1 143117761 143292816 175055
chr1 143342816 143544525 201709
chr1 143644525 143771002 126477
chr1 143871002 144095783 224781
chr1 144145783 144224481 78698
chr1 144274481 144401744 127263
chr1 144451744 144622413 170669
chr1 144672413 144710724 38311
chr1 144810724 145833118 1022394
chr1 145883118 146164650 281532
chr1 146214650 146253299 38649
chr1 146303299 148026038 1722739
chr1 148176038 148361358 185320
chr1 148511358 148684147 172789
chr1 148734147 148954460 220313
chr1 149004460 149459645 455185
chr1 149509645 205922707 56413062
chr1 206072707 206332221 259514
chr1 206482221 223747846 17265625
chr1 223797846 235192211 11394365
chr1 235242211 248908210 13665999
chr1 249058210 249240621 182411
chr10 60000 17974675 17914675
chr10 18024675 38818835 20794160
chr10 38868835 39154935 286100
chr10 42354935 42546687 191752
chr10 42596687 46426964 3830277
chr10 46476964 47429169 952205
chr10 47529169 47792476 263307
chr10 47892476 48055707 163231
chr10 48105707 49095536 989829
chr10 49195536 51137410 1941874
chr10 51187410 51398845 211435
chr10 51448845 125869472 74420627
chr10 125919472 128616069 2696597
chr10 128766069 133381404 4615335
chr10 133431404 133677527 246123
chr10 133727527 135524747 1797220
chr11 60000 1162759 1102759
chr11 1212759 50783853 49571094
chr11 51090853 51594205 503352
chr11 54694205 69089801 14395596
chr11 69139801 69724695 584894
chr11 69774695 87688378 17913683
chr11 87738378 96287584 8549206
chr11 96437584 134946516 38508932
chr12 60000 95739 35739
chr12 145739 7189876 7044137
chr12 7239876 34856694 27616818
chr12 37856694 109373470 71516776
chr12 109423470 122530623 13107153
chr12 122580623 132706992 10126369
chr12 132806992 133841895 1034903
chr13 19020000 86760324 67740324
chr13 86910324 112353994 25443670
chr13 112503994 114325993 1821999
chr13 114425993 114639948 213955
chr13 114739948 115109878 369930
chr14 19000000 107289540 88289540
chr15 20000000 20894633 894633
chr15 20935075 21398819 463744
chr15 21885000 22212114 327114
chr15 22262114 22596193 334079
chr15 22646193 23514853 868660
chr15 23564853 29159443 5594590
chr15 29209443 82829645 53620202
chr15 82879645 84984473 2104828
chr15 85034473 102521392 17486919
chr16 60000 8636921 8576921
chr16 8686921 34023150 25336229
chr16 34173150 35285801 1112651
chr16 46385801 88389383 42003582
chr16 88439383 90294753 1855370
chr17 0 296626 296626
chr17 396626 21566608 21169982
chr17 21666608 22263006 596398
chr17 25263006 34675848 9412842
chr17 34725848 62410760 27684912
chr17 62460760 77546461 15085701
chr17 77596461 79709049 2112588
chr17 79759049 81195210 1436161
chr18 10000 15410898 15400898
chr18 18510898 52059136 33548238
chr18 52209136 72283353 20074217
chr18 72333353 75721820 3388467
chr18 75771820 78017248 2245428
chr19 60000 7346004 7286004
chr19 7396004 8687198 1291194
chr19 8737198 20523415 11786217
chr19 20573415 24631782 4058367
chr19 27731782 59118983 31387201
chr2 10000 3529312 3519312
chr2 3579312 5018788 1439476
chr2 5118788 16279724 11160936
chr2 16329724 21153113 4823389
chr2 21178113 87668206 66490093
chr2 87718206 89630436 1912230
chr2 89830436 90321525 491089
chr2 90371525 90545103 173578
chr2 91595103 92326171 731068
chr2 95326171 110109337 14783166
chr2 110251337 149690582 39439245
chr2 149790582 234003741 84213159
chr2 234053741 239801978 5748237
chr2 239831978 240784132 952154
chr2 240809132 243102476 2293344
chr2 243152476 243189373 36897
chr20 60000 26319569 26259569
chr20 29419569 29653908 234339
chr20 29803908 34897085 5093177
chr20 34947085 61091437 26144352
chr20 61141437 61213369 71932
chr20 61263369 62965520 1702151
chr21 9411193 9595548 184355
chr21 9645548 9775437 129889
chr21 9825437 10034920 209483
chr21 10084920 10215976 131056
chr21 10365976 10647896 281920
chr21 10697896 11188129 490233
chr21 14338129 42955559 28617430
chr21 43005559 44632664 1627105
chr21 44682664 48119895 3437231
chr22 16050000 16697850 647850
chr22 16847850 20509431 3661581
chr22 20609431 50364777 29755346
chr22 50414777 51244566 829789
chr3 60000 66170270 66110270
chr3 66270270 90504854 24234584
chr3 93504854 194041961 100537107
chr3 194047251 197962430 3915179
chr4 10000 1423146 1413146
chr4 1478646 8799203 7320557
chr4 8818203 9274642 456439
chr4 9324642 31820917 22496275
chr4 31837417 32834638 997221
chr4 32840638 40296396 7455758
chr4 40297096 49338941 9041845
chr4 49488941 49660117 171176
chr4 52660117 59739333 7079216
chr4 59789333 75427379 15638046
chr4 75452279 191044276 115591997
chr5 10000 17530657 17520657
chr5 17580657 46405641 28824984
chr5 49405641 91636128 42230487
chr5 91686128 138787073 47100945
chr5 138837073 155138727 16301654
chr5 155188727 180905260 25716533
chr6 60000 58087659 58027659
chr6 58137659 58780166 642507
chr6 61880166 62128589 248423
chr6 62178589 95680543 33501954
chr6 95830543 157559467 61728924
chr6 157609467 157641300 31833
chr6 157691300 167942073 10250773
chr6 168042073 170279972 2237899
chr6 170329972 171055067 725095
chr7 10000 232484 222484
chr7 282484 50370631 50088147
chr7 50410631 58054331 7643700
chr7 61054331 61310513 256182
chr7 61360513 61460465 99952
chr7 61510465 61677020 166555
chr7 61727020 61917157 190137
chr7 61967157 74715724 12748567
chr7 74765724 100556043 25790319
chr7 100606043 130154523 29548480
chr7 130254523 139379377 9124854
chr7 139404377 142048195 2643818
chr7 142098195 142276197 178002
chr7 142326197 143347897 1021700
chr7 143397897 154270634 10872737
chr7 154370634 159128663 4758029
chr8 10000 7474649 7464649
chr8 7524649 12091854 4567205
chr8 12141854 43838887 31697033
chr8 46838887 48130499 1291612
chr8 48135599 86576451 38440852
chr8 86726451 142766515 56040064
chr8 142816515 145332588 2516073
chr8 145432588 146304022 871434
chr9 10000 39663686 39653686
chr9 39713686 39974796 261110
chr9 40024796 40233029 208233
chr9 40283029 40425834 142805
chr9 40475834 40940341 464507
chr9 40990341 41143214 152873
chr9 41193214 41365793 172579
chr9 41415793 42613955 1198162
chr9 42663955 43213698 549743
chr9 43313698 43946569 632871
chr9 43996569 44676646 680077
chr9 44726646 44908293 181647
chr9 44958293 45250203 291910
chr9 45350203 45815521 465318
chr9 45865521 46216430 350909
chr9 46266430 46461039 194609
chr9 46561039 47060133 499094
chr9 47160133 47317679 157546
chr9 65467679 65918360 450681
chr9 65968360 66192215 223855
chr9 66242215 66404656 162441
chr9 66454656 66614195 159539
chr9 66664195 66863343 199148
chr9 66913343 67107834 194491
chr9 67207834 67366296 158462
chr9 67516296 67987998 471702
chr9 68137998 68514181 376183
chr9 68664181 68838946 174765
chr9 68988946 69278385 289439
chr9 69328385 70010542 682157
chr9 70060542 70218729 158187
chr9 70318729 70506535 187806
chr9 70556535 70735468 178933
chr9 70835468 92343416 21507948
chr9 92443416 92528796 85380
chr9 92678796 133073060 40394264
chr9 133223060 137041193 3818133
chr9 137091193 139166997 2075804
chr9 139216997 141153431 1936434
chrM 0 16571 16571
chrX 60000 94821 34821
chrX 144821 231384 86563
chrX 281384 1047557 766173
chrX 1097557 1134113 36556
chrX 1184113 1264234 80121
chrX 1314234 2068238 754004
chrX 2118238 7623882 5505644
chrX 7673882 10738674 3064792
chrX 10788674 37098256 26309582
chrX 37148256 49242997 12094741
chrX 49292997 49974173 681176
chrX 50024173 52395914 2371741
chrX 52445914 58582012 6136098
chrX 61682012 76653692 14971680
chrX 76703692 113517668 36813976
chrX 113567668 115682290 2114622
chrX 115732290 120013235 4280945
chrX 120063235 143507324 23444089
chrX 143557324 148906424 5349100
chrX 148956424 149032062 75638
chrX 149082062 152277099 3195037
chrX 152327099 155260560 2933461
chrY 10000 44821 34821
chrY 94821 181384 86563
chrY 231384 997557 766173
chrY 1047557 1084113 36556
chrY 1134113 1214234 80121
chrY 1264234 2018238 754004
chrY 2068238 8914955 6846717
chrY 8964955 9241322 276367
chrY 9291322 10104553 813231
chrY 13104553 13143954 39401
chrY 13193954 13748578 554624
chrY 13798578 20143885 6345307
chrY 20193885 22369679 2175794
chrY 22419679 23901428 1481749
chrY 23951428 28819361 4867933
chrY 58819361 58917656 98295
chrY 58967656 59363566 395910

Here is the source for my first-fit approach:

#!/usr/bin/env python
''' Using first-fit bin packing algorithm implementation from: https://bitbucket.org/kent37/python-tutor-samples/src/f657aeba5328/BinPacking.py '''
class Bin(object):
def __init__(self):
self.elements = []
self.size = 0
self.index = 0
def append(self, element_key, element_value):
self.elements.append(element_key)
self.size += int(element_value)
def __len__(self):
len(self.elements)
def __str__(self):
return 'Bin(size=%d\n\tnumber_of_elements=%d\n\telements=%s)' % (self.size, int(len(self.elements)), self.elements)
def pack(elements, maximum_bin_size):
bins = []
bin_count = 0
for element in elements:
elem_range = element[0]
elem_size = element[1]
for bin in bins:
if bin.size + elem_size <= maximum_bin_size:
bin.append(elem_range, elem_size)
break
else:
bin = Bin()
bin.append(elem_range, elem_size)
bin.index = bin_count
bins.append(bin)
bin_count += 1
return bins
if __name__ == '__main__':
import sys
import numpy
import random
elements = {}
bins = int(sys.argv[2])
fudge = float(sys.argv[3])
with open(sys.argv[1]) as f:
for line in f:
(chr, start, stop, size) = line.split()
key = chr + ":" + start + "-" + stop
value = int(size)
elements[key] = value
sorted_elements = sorted(elements.items(), key=lambda x: x[1], reverse=True)
largest_element_size = sorted_elements[0][1]
maximum_per_bin_size = int(numpy.ceil(largest_element_size / bins / fudge) * fudge)
print '---------------------------------------------------'
packed_bins = pack(sorted_elements, maximum_per_bin_size)
print 'Via sorted elements -- using', len(packed_bins), 'of', bins, 'bins ( max_size:', maximum_per_bin_size,'):'
bin_sizes = numpy.arange(len(packed_bins))
for bin in packed_bins:
print bin
bin_sizes[bin.index] = bin.size
print 'Size std dev --', numpy.std(bin_sizes)
print '---------------------------------------------------'
random_elements = sorted(elements.items(), key=lambda x: random.random())
packed_bins = pack(random_elements, maximum_per_bin_size)
print 'Via randomly shuffled elements -- using', len(packed_bins), 'of', bins, 'bins ( max_size:', maximum_per_bin_size,'):'
bin_sizes = numpy.arange(len(packed_bins))
for bin in packed_bins:
print bin
bin_sizes[bin.index] = bin.size
print 'Size std dev --', numpy.std(bin_sizes)
view raw first_fit.py hosted with ❤ by GitHub

And here is a result from one trial run:

On repeated trials, I seem to get more evenly distributed coverage of regions over the nodes when using a randomly shuffled dataset, as compared with bin packing on the descending-sort dataset.

I also get less "waste" per bin with random shuffling, i.e. a smaller standard deviation of total bases across bins.

This might be a jumping point to adding more statistics and doing more trials, to see how random shuffling performs, on average, against sorted data. And there are numerous ways to clean up my Python script and make things more "Pythonic", I'm certain.

Some thoughts: It should be possible to get a smoother distribution of elements per bin/node — and accordingly a smaller standard deviation of bases over bins — by putting a maximum size on a region that is smaller than the largest element. One would then split elements of size larger than this limit. The smaller the limit, the more disjoint elements, but I think it should be easier to pack regions into a bin more evenly.

Interesting problem. I am interested to see how others tackle it.

ADD COMMENT
0
Entering edit mode

+1 for providing bed files.

ADD REPLY
0
Entering edit mode

thanks for this solution. i was not thinking of using less than X cores, but that is easily remedied.

ADD REPLY
2
Entering edit mode
11.7 years ago

I know this is overkill for this particular challenge but here is how I do this in practice for variant calling parallelization:

  • Identify non-callable regions (no coverage or NNN regions) in each sample.
  • Group regions for all samples in a population called jointly to identify shared non-callable regions.
  • Pick non-callable regions evenly spaced across the genome.
  • Split into regions bounded by these non-callable regions.
  • Parallelize work in those relatively uniform regions.

I make heavy use of pybedtools/bedtools to handle merging all the intervals:

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/bam/callable.py

ADD COMMENT
0
Entering edit mode

i take it the regions are so small and there are so many of them you simply don't need to worry about load-balancing?

ADD REPLY
0
Entering edit mode

I do load balance by picking regions to split at that are evenly spaced. It's definitely more a heuristic approach: divide the genome size by the number of blocks you want, then only include split positions that are far enough from the last split:

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/bam/callable.py#L167

Since I'm not evenly spllit anyway because of coverage requirement, this works decently to avoid the very large regions (or lots of tiny regions).

ADD REPLY
1
Entering edit mode

I think Jeremy is asking if the frequency of assembly gaps is high enough that you can always find one close enough to your desired split point.

ADD REPLY
0
Entering edit mode

Ryan -- makes sense. I'm starting with the additional constraint that split points need to satisfy biological criteria (no coverage) so agree that my approach won't match the optimal splitting allowing breakpoints anywhere. It tries to do the best load-balancing it can under those conditions.

ADD REPLY

Login before adding your answer.

Traffic: 1263 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6