Programming Challenge: Divide The Human Genome Among X Cores, Taking Into Account Gaps
3
4
Entering edit mode
11.4 years ago

Develop a tool that divides the hg19 human genome (1-22XYM) to distribute among different cores or nodes. The goals should be:

  • divide the genome such that each core has to deal with a roughly equal number of eligible base pairs (see gap note below)
  • keep these intervals non-overlapping
  • keep these intervals as close to large contiguous blocks as possible, for 100 cores, you can certainly have 120 intervals (someone has to get MT), but 500 intervals would be too much
  • take account genomic assembly gaps which are all NNN . You can include the gaps in the intervals, but they do not add to the burden and therefore should not be considered in the size calculation. Here some some gaps from UCSC table browser -> mapping and sequencing tracks -> gap

    bin chrom chromStart chromEnd ix n size type bridge

    0 chr1 124535434 142535434 1271 N 18000000 heterochromatin no

    23 chr1 121535434 124535434 1270 N 3000000 centromere no

    76 chr1 3845268 3995268 47 N 150000 contig no

    85 chr1 13219912 13319912 154 N 100000 contig no

    89 chr1 17125658 17175658 196 N 50000 clone yes

(Here is a copy of that table for 1:22XY)

A carefully considered metric for evaluating the solution is:

Score = (Std dev eligible bp per core) * (Number of intervals) * (Execution time in seconds) * (Lines of Code)

Lowest score wins!

Looking forward to seeing your code!

programming • 5.7k views
ADD COMMENT
0
Entering edit mode

do you want the intervals to overlap ?

ADD REPLY
0
Entering edit mode

no, updated post

ADD REPLY
5
Entering edit mode
11.4 years ago

Here is my solution. I put my code on github https://github.com/lindenb/jvarkit#biostar77828 https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/biostar/Biostar77828.java (requires the picard library)

The makefile: I use bedtools to substract the gaps from hg19. I removed the "*hap" chromosomes.

all: tmp.result.txt

tmp.result.txt : dist/biostar77828.jar tmp3.bed
    java -jar dist/biostar77828.jar VERBOSITY=DEBUG N_ITERATIONS=10000000 MIN_CORE=20 MAX_CORE=30 < tmp3.bed > $@

dist/biostar77828.jar: src/main/java/com/github/lindenb/jvarkit/tools/biostar/Biostar77828.java
     ant biostar77828

tmp3.bed: tmp1.bed tmp2.bed
    /commun/data/packages/BEDTools-Version-2.16.2/bin/subtractBed -a tmp1.bed -b tmp2.bed > $@

tmp1.bed:
     curl  "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz" |\
         gunzip -c |\
        grep -v hap |\
        awk -F '    ' '{printf("%s\t0\t%s\n",$$1,$$2);}' |\
         LC_ALL=C sort -t '    ' -k1,1 -k2,2n -k3,3n > $@
tmp2.bed:
     curl  "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz" |\
         gunzip -c |\
        grep -v hap |\
        cut -d '    ' -f2,3,4 |\
         LC_ALL=C sort -t '    ' -k1,1 -k2,2n -k3,3n > $@

I use a strategy using a random generator of solution, looping for N generations and printing the best one:

        Solution best=null;
        for(long generation=0;generation< this.N_ITERATIONS;++generation)
            {
            Solution sol=createSolution();

            if(best==null || sol.compareTo(best)<0)
                {
                best=sol;
                }
            }

Creating a solution: a random number of 'Core' objects is created and filled with a random segment.

    int n_cores=
            MIN_CORE+(MIN_CORE>=MAX_CORE?0:this.random.nextInt(MAX_CORE-MIN_CORE))
    (...)
    Collections.shuffle(segments, this.random);
    (...)
    for(int i=0;i< n_cores && i< segments.size();++i)
        {
        //get last
        Core core=new Core();
        core.segments.add(segments.get(i));
        sol.cores.add(core);
        }

then each BED segment is added in a Core with the smallest variation of the mean number of bases

            for(Core core:sol.cores)
                {
                if(best==null ||
                    (Math.abs((core.length()+seg.size())-mean) < Math.abs((best.length()+seg.size())-mean)))
                    {
                    best=core;
                    }

                }
            best.segments.add(seg);

output:

https://gist.github.com/lindenb/6130880/#file-biostar77828-bed

ADD COMMENT
3
Entering edit mode
11.4 years ago

I haven't calculated scores, but I applied a first-fit algorithm on a regions-minus-gaps file generated with BEDOPS, where the regions are either sorted in descending order by size or shuffled randomly, before binning.

Here are the hg19 gaps I used:

Here are the hg19 regions ("extents") I used:

To generate the regions-minus-gaps file, adding size values:

$ bedops --difference hg19.extents.bed hg19.gaps.bed \
    | awk '{ print $0"\t"($3-$2) }' - \
    > hg19.gapped_extents.bed

Here is the regions-minus-gaps file that I get:

Here is the source for my first-fit approach:

And here is a result from one trial run:

On repeated trials, I seem to get more evenly distributed coverage of regions over the nodes when using a randomly shuffled dataset, as compared with bin packing on the descending-sort dataset.

I also get less "waste" per bin with random shuffling, i.e. a smaller standard deviation of total bases across bins.

This might be a jumping point to adding more statistics and doing more trials, to see how random shuffling performs, on average, against sorted data. And there are numerous ways to clean up my Python script and make things more "Pythonic", I'm certain.

Some thoughts: It should be possible to get a smoother distribution of elements per bin/node — and accordingly a smaller standard deviation of bases over bins — by putting a maximum size on a region that is smaller than the largest element. One would then split elements of size larger than this limit. The smaller the limit, the more disjoint elements, but I think it should be easier to pack regions into a bin more evenly.

Interesting problem. I am interested to see how others tackle it.

ADD COMMENT
0
Entering edit mode

+1 for providing bed files.

ADD REPLY
0
Entering edit mode

thanks for this solution. i was not thinking of using less than X cores, but that is easily remedied.

ADD REPLY
2
Entering edit mode
11.4 years ago

I know this is overkill for this particular challenge but here is how I do this in practice for variant calling parallelization:

  • Identify non-callable regions (no coverage or NNN regions) in each sample.
  • Group regions for all samples in a population called jointly to identify shared non-callable regions.
  • Pick non-callable regions evenly spaced across the genome.
  • Split into regions bounded by these non-callable regions.
  • Parallelize work in those relatively uniform regions.

I make heavy use of pybedtools/bedtools to handle merging all the intervals:

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/bam/callable.py

ADD COMMENT
0
Entering edit mode

i take it the regions are so small and there are so many of them you simply don't need to worry about load-balancing?

ADD REPLY
0
Entering edit mode

I do load balance by picking regions to split at that are evenly spaced. It's definitely more a heuristic approach: divide the genome size by the number of blocks you want, then only include split positions that are far enough from the last split:

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/bam/callable.py#L167

Since I'm not evenly spllit anyway because of coverage requirement, this works decently to avoid the very large regions (or lots of tiny regions).

ADD REPLY
1
Entering edit mode

I think Jeremy is asking if the frequency of assembly gaps is high enough that you can always find one close enough to your desired split point.

ADD REPLY
0
Entering edit mode

Ryan -- makes sense. I'm starting with the additional constraint that split points need to satisfy biological criteria (no coverage) so agree that my approach won't match the optimal splitting allowing breakpoints anywhere. It tries to do the best load-balancing it can under those conditions.

ADD REPLY

Login before adding your answer.

Traffic: 2176 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6