Question

Programming Challenge: Divide The Human Genome Among X Cores, Taking Into Account Gaps

4

Entering edit mode

11.4 years ago

Jeremy Leipzig 22k

Develop a tool that divides the hg19 human genome (1-22XYM) to distribute among different cores or nodes. The goals should be:

divide the genome such that each core has to deal with a roughly equal number of eligible base pairs (see gap note below)
keep these intervals non-overlapping
keep these intervals as close to large contiguous blocks as possible, for 100 cores, you can certainly have 120 intervals (someone has to get MT), but 500 intervals would be too much
take account genomic assembly gaps which are all NNN . You can include the gaps in the intervals, but they do not add to the burden and therefore should not be considered in the size calculation. Here some some gaps from UCSC table browser -> mapping and sequencing tracks -> gap

bin chrom chromStart chromEnd ix n size type bridge

0 chr1 124535434 142535434 1271 N 18000000 heterochromatin no

23 chr1 121535434 124535434 1270 N 3000000 centromere no

76 chr1 3845268 3995268 47 N 150000 contig no

85 chr1 13219912 13319912 154 N 100000 contig no

89 chr1 17125658 17175658 196 N 50000 clone yes

(Here is a copy of that table for 1:22XY)

A carefully considered metric for evaluating the solution is:

Score = (Std dev eligible bp per core) * (Number of intervals) * (Execution time in seconds) * (Lines of Code)

Lowest score wins!

Looking forward to seeing your code!

programming • 5.7k views

ADD COMMENT • link updated 5.2 years ago by Biostar 20 • written 11.4 years ago by Jeremy Leipzig 22k

0

Entering edit mode

do you want the intervals to overlap ?

ADD REPLY • link updated 11.4 years ago by Jeremy Leipzig 22k • written 11.4 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

no, updated post

ADD REPLY • link 11.4 years ago by Jeremy Leipzig 22k

score 5 · Answer 1 · 2013-08-01

Here is my solution. I put my code on github https://github.com/lindenb/jvarkit#biostar77828 https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/biostar/Biostar77828.java (requires the picard library)

The makefile: I use bedtools to substract the gaps from hg19. I removed the "*hap" chromosomes.

all: tmp.result.txt

tmp.result.txt : dist/biostar77828.jar tmp3.bed
    java -jar dist/biostar77828.jar VERBOSITY=DEBUG N_ITERATIONS=10000000 MIN_CORE=20 MAX_CORE=30 < tmp3.bed > $@

dist/biostar77828.jar: src/main/java/com/github/lindenb/jvarkit/tools/biostar/Biostar77828.java
     ant biostar77828

tmp3.bed: tmp1.bed tmp2.bed
    /commun/data/packages/BEDTools-Version-2.16.2/bin/subtractBed -a tmp1.bed -b tmp2.bed > $@

tmp1.bed:
     curl  "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz" |\
         gunzip -c |\
        grep -v hap |\
        awk -F '    ' '{printf("%s\t0\t%s\n",$$1,$$2);}' |\
         LC_ALL=C sort -t '    ' -k1,1 -k2,2n -k3,3n > $@
tmp2.bed:
     curl  "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz" |\
         gunzip -c |\
        grep -v hap |\
        cut -d '    ' -f2,3,4 |\
         LC_ALL=C sort -t '    ' -k1,1 -k2,2n -k3,3n > $@

I use a strategy using a random generator of solution, looping for N generations and printing the best one:

        Solution best=null;
        for(long generation=0;generation< this.N_ITERATIONS;++generation)
            {
            Solution sol=createSolution();

            if(best==null || sol.compareTo(best)<0)
                {
                best=sol;
                }
            }

Creating a solution: a random number of 'Core' objects is created and filled with a random segment.

    int n_cores=
            MIN_CORE+(MIN_CORE>=MAX_CORE?0:this.random.nextInt(MAX_CORE-MIN_CORE))
    (...)
    Collections.shuffle(segments, this.random);
    (...)
    for(int i=0;i< n_cores && i< segments.size();++i)
        {
        //get last
        Core core=new Core();
        core.segments.add(segments.get(i));
        sol.cores.add(core);
        }

then each BED segment is added in a Core with the smallest variation of the mean number of bases

            for(Core core:sol.cores)
                {
                if(best==null ||
                    (Math.abs((core.length()+seg.size())-mean) < Math.abs((best.length()+seg.size())-mean)))
                    {
                    best=core;
                    }

                }
            best.segments.add(seg);

output:

https://gist.github.com/lindenb/6130880/#file-biostar77828-bed

Ram · Answer 2 · 2013-07-31

I haven't calculated scores, but I applied a first-fit algorithm on a regions-minus-gaps file generated with BEDOPS, where the regions are either sorted in descending order by size or shuffled randomly, before binning.

Here are the hg19 gaps I used:

Here are the hg19 regions ("extents") I used:

To generate the regions-minus-gaps file, adding size values:

$ bedops --difference hg19.extents.bed hg19.gaps.bed \
    | awk '{ print $0"\t"($3-$2) }' - \
    > hg19.gapped_extents.bed

Here is the regions-minus-gaps file that I get:

Here is the source for my first-fit approach:

And here is a result from one trial run:

On repeated trials, I seem to get more evenly distributed coverage of regions over the nodes when using a randomly shuffled dataset, as compared with bin packing on the descending-sort dataset.

I also get less "waste" per bin with random shuffling, i.e. a smaller standard deviation of total bases across bins.

This might be a jumping point to adding more statistics and doing more trials, to see how random shuffling performs, on average, against sorted data. And there are numerous ways to clean up my Python script and make things more "Pythonic", I'm certain.

Some thoughts: It should be possible to get a smoother distribution of elements per bin/node — and accordingly a smaller standard deviation of bases over bins — by putting a maximum size on a region that is smaller than the largest element. One would then split elements of size larger than this limit. The smaller the limit, the more disjoint elements, but I think it should be easier to pack regions into a bin more evenly.

Interesting problem. I am interested to see how others tackle it.

score 2 · Answer 3 · 2013-07-31

2

Entering edit mode

11.4 years ago

Brad Chapman 9.7k

I know this is overkill for this particular challenge but here is how I do this in practice for variant calling parallelization:

Identify non-callable regions (no coverage or NNN regions) in each sample.
Group regions for all samples in a population called jointly to identify shared non-callable regions.
Pick non-callable regions evenly spaced across the genome.
Split into regions bounded by these non-callable regions.
Parallelize work in those relatively uniform regions.

I make heavy use of pybedtools/bedtools to handle merging all the intervals:

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/bam/callable.py

ADD COMMENT • link 11.4 years ago by Brad Chapman 9.7k

0

Entering edit mode

i take it the regions are so small and there are so many of them you simply don't need to worry about load-balancing?

ADD REPLY • link 11.4 years ago by Jeremy Leipzig 22k

0

Entering edit mode

I do load balance by picking regions to split at that are evenly spaced. It's definitely more a heuristic approach: divide the genome size by the number of blocks you want, then only include split positions that are far enough from the last split:

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/bam/callable.py#L167

Since I'm not evenly spllit anyway because of coverage requirement, this works decently to avoid the very large regions (or lots of tiny regions).

ADD REPLY • link 11.4 years ago by Brad Chapman 9.7k

1

Entering edit mode

I think Jeremy is asking if the frequency of assembly gaps is high enough that you can always find one close enough to your desired split point.

ADD REPLY • link 11.4 years ago by Ryan Thompson ★ 3.6k

0

Entering edit mode

Ryan -- makes sense. I'm starting with the additional constraint that split points need to satisfy biological criteria (no coverage) so agree that my approach won't match the optimal splitting allowing breakpoints anywhere. It tries to do the best load-balancing it can under those conditions.

ADD REPLY • link 11.4 years ago by Brad Chapman 9.7k