I have a very basic question about bedtools and bedops. Can I use these tools to filter all the regions where the score is higher (or lower) than a given value?
For example, let's say that I have a BED file like the following:
chr7 127471196 127472363 Pos1 12 + 127471196 127472363 255,0,0
chr7 127472363 127473530 Pos2 200 + 127472363 127473530 255,0,0
chr7 127473530 127474697 Pos3 120 + 127473530 127474697 255,0,0
chr7 127474697 127475864 Pos4 54 + 127474697 127475864 255,0,0
chr7 127475864 127477031 Neg1 2 - 127475864 127477031 0,0,255
chr7 127477031 127478198 Neg2 15 - 127477031 127478198 0,0,255
chr7 127478198 127479365 Neg3 25 - 127478198 127479365 0,0,255
chr7 127479365 127480532 Pos5 2 + 127479365 127480532 255,0,0
chr7 127480532 127481699 Neg4 9 - 127480532 127481699 0,0,255
According to the BED format's specs, the fifth column contains a score, between 0 and 1000 (alternatively, in the bedGraph format the score is on the 4th position). If I want to get all the regions that have a score higher than 20, for example, I can do an awk search:
$: awk '$5 > 20 {print}' mybedfile.bed
However, in order to use awk, I have to keep the BED file in a uncompressed format. It would be much better if I could use the .starch format in Bedops, or if I could combine any Bedops/Bedtools operation with the score search (e.g. get all scores that overlap a region and are higher than a value).
I don't get it, why can't you just do: zcat file.bed.gz | awk '$5 > 20 {print}' |bgzip -c > out.gz
mainly two reasons: 1) I want to try the .starch format in BEDOPS, without having to decompress every time; 2) I wonder why there are options to calculate the mean, max scores for a region without having to decompress the file, and not for getting values higher than a threshold.
It seems that the program merely calls burrows-wheeler zip or normal gzip. So internally it will decompress. For 2) I would use tabix indexed files and iterate over regions. I am not familiar enough with bedops to know whether it can do that.
We actually do quite a bit better than other algorithms by first pre-processing the BED data before compression:
We also offer per-chromosome indexing with the
--chrom
operator. I am working on modifying our Starch format that would extend our chromosome-indexing approach to support querying regions, but it remains to be seen whether this can be done without sacrificing too much of the compression efficiency improvements in the Starch approach.Part of the cost of pre-processing means having to re-process on extraction. Querying regions may require some overhead to store "signposts" that allow reconstruction of the original data without starting from the front end of each per-chromosome data stream. (Or we skip pre-processing and just naively compress sorted BED data, but that's not as interesting a problem.)