Crawling Over 1648 Million Line For Phastcons Scores
2
4
Entering edit mode
14.3 years ago
User 0968 ▴ 40

I got a file which contain phastCons for every position in cow. It's more than 1648 million lines. What's the best approach to the phastCons for the positions i want ?

file start with header and each line is like 0.043 which is the phastCons for positon (Start mention in header + steps*lineNumber)

• 2.6k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
5
Entering edit mode
14.3 years ago

If this file is something like http://hgdownload.cse.ucsc.edu/goldenPath/bosTau4/phastCons5way/, then to find the data on chr1 between 13772 and 23872 I would use something like the following awk script:

/^fixedStep/    {
    split($0,a,"[ \t]+");
    split(a[2],s2,"=");
    split(a[3],s3,"=");
    split(a[4],s4,"=");
    chrom=s2[2];
    start=int(s3[2]);
    step=int(s4[2]);
    pos=start-step;

    next;
    }

    {
    pos+=step;
    if(chrom=="chr1" && pos>=13772 && pos<=23872)
        {
        printf("%s %d\t%s\n",chrom,pos,$1);
        }
    }

Command line:

curl -s  http://hgdownload.cse.ucsc.edu/goldenPath/bosTau4/phastCons5way/file_0.data.gz |\
gunzip -c |\
awk -f ~/wig.awk
chr1 13902      0.007
chr1 13903      0.013
chr1 13904      0.017
chr1 13905      0.020
chr1 13906      0.022
chr1 13907      0.022
chr1 13908      0.022
chr1 13909      0.025
chr1 13910      0.027
chr1 13911      0.028
chr1 13912      0.027
chr1 13913      0.025
chr1 13914      0.023
chr1 13915      0.024
(...)

If speed is an issue, you can store this kind of data in a sorted binary file(or a key/value file) and access this data with a (e.g.) C program.

ADD COMMENT
5
Entering edit mode
14.3 years ago

These files look like UCSC fixedstep wiggle format files. "Crawling" suggests to me a stream-based approach, reading one line at a time in a scripting language and processing each data point, before discarding it. This is fine if you want to do this one time only. Your data are large enough that you don't want to make multiple passes through it.

If you need to make multiple passes, or to extract sub-regions quickly or repeatedly, then convert to an indexed binary format. Tools exist already to do this for wiggle files. Convert the file to bigwig format using the wigToBigWig program described in the bigwig paper. You will also need the chromosome and/or contig sizes for your data to do this. Then you can access intervals via the C API described in the paper, or there's a Perl wrapper, if Perl is your thing.

ADD COMMENT

Login before adding your answer.

Traffic: 1852 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6