Question

Non-Redundant List Of Grch37 Coordinates Covered By Segmental Duplications

0

Entering edit mode

13.8 years ago

Dgmacarthur ▴ 310

Hi all,

Hopefully an easy one: I'm looking to get a file containing the coordinates of every base in the human build 37 genome that is covered by a segmental duplication (e.g. a BED file).

I've downloaded the full set of seg dups from http://humanparalogy.gs.washington.edu/build37/build37.htm but these appear to contain a redundant set of all pairwise locations of segmental duplications. I could write some code to merge these, but has anyone already generated a non-redundant file that simply tells me whether a given GRCh37 base is in fact spanned by a seg dup?

human • 4.7k views

ADD COMMENT • link updated 13.8 years ago by Pierre Lindenbaum 166k • written 13.8 years ago by Dgmacarthur ▴ 310

1

Entering edit mode

Please note that different databases may give you vastly different results. The first question to ask is "which is the most accurate" instead of "which is the most convenient". Merging overlapping regions in a BED is extremely easy. You can use bedtools, or just one line of awk.

ADD REPLY • link 13.8 years ago by lh3 33k

Ram · Answer 1 · 2011-10-05

4

Entering edit mode

13.8 years ago

Casey Bergman 18k

Is this what you are looking for: http://genome.ucsc.edu/cgi-bin/hgTrackUi?g=genomicSuperDups

If you want BED format, use the Table Browser, click this link, select BED from the "output format" dropdown menu, click "get output" and then click "get BED" on the next page

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 13.8 years ago by Casey Bergman 18k

1

Entering edit mode

Just a note that if you do this from Galaxy, you can then merge the overlapping bed records and get the unique bed regions covered by at least one segmental duplication.

ADD REPLY • link 13.8 years ago by Sean Davis 27k

score 1 · Answer 2 · 2011-10-05

The following C++ should provide a true/false WIG file:

#include <iostream>
#include <vector>
#include <string>
#include <cstdlib>

using namespace std;

static void wig(string& chrom, vector<bool>& bits)
    {
    size_t i=0;
    if(chrom.empty()) return;
    while(i< bits.size() && bits[i]==false) i++;
    /* what is the first base for wig 0 or 1 ?? */
    cout << "fixedStep chrom="<< chrom <<" start="<<i<<" step=1 span=1" << endl;

    while(i< bits.size())
        {
        cout << (int)bits[i++]<< endl;
        }
    }

int main(int argc,char** argv)
    {
    vector<bool> bits;
    string chrom;
    string line;
    while(getline(cin,line,'\n'))
        {
        if(line.empty() or line[0]=='#') continue;
        string::size_type n1=line.find('\t',0);
        if(n1==0 || n1==string::npos) continue;
        string::size_type n2=line.find('\t',n1+1);
        if(n2==string::npos) continue;
        string s=line.substr(0,n1);
        if(s.compare(chrom)!=0)
            {
            wig(chrom,bits);
            bits.clear();
            chrom=s;
            }
        s=line.substr(n1+1,n2-n1);
        char* p2;
        long chromStart=strtol(s.c_str(),&p2,10);

        if(chromStart< 0 || *p2!='\t')
            {
            cerr << "bad start in " << s << endl;   
            continue;
            }
        s=line.substr(n2+1);

        long chromEnd=strtol(s.c_str(),&p2,10);
        if(chromEnd< chromStart || (*p2!=0 && *p2!='\t'))
            {
            cerr << "bad end in " << s << endl; 
            continue;
            }

        if(bits.size()<=(size_t)chromEnd)
            {
            bits.resize(chromEnd,false);
            }
        while(chromStart<chromEnd)
            {
            bits[(size_t)chromStart]=true;
            ++chromStart;
            }
        }
    wig(chrom,bits);
    return 0;
    }

test:

$ g++ jeter.cpp
$ mysql  --user=genome -N --host=genome-mysql.cse.ucsc.edu -A   -D hg19 -e "select chrom,chromStart,chromEnd from genomicSuperDups where chromEnd <100000 " |\
./a.out  |\
grep chrom -A 2 -B 2

fixedStep chrom=chr11_gl000202_random start=0 step=1 span=1
1
1
--
1
1
fixedStep chrom=chr17_gl000203_random start=8 step=1 span=1
1
1
--
1
1
fixedStep chrom=chr17_gl000205_random start=0 step=1 span=1
1
1
--
1
1
fixedStep chrom=chr1_gl000192_random start=1270 step=1 span=1
1