The BEDOPS suite includes bedmap
, which performs statistical calculations on BED elements. For your question, the --bases
operator returns the number of bases of overlap between the reference element and mapping elements. If we use the input BED file for both reference and mapping, then we simply get back the number of bases that make up each element.
For example, let's say we have the following BED file:
chr2 160 210 id-2 4
chr2 220 490 id-3 10
We can run bedmap
on this file to get the number of bases in each element:
$ bedmap --echo --bases --delim '\t' test.bed
chr2 160 210 id-2 4 50
chr2 220 390 id-3 10 170
We can pipe to awk
or any other downstream tool for filtering:
$ bedmap --echo --bases --delim '\t' test.bed | awk '{if ($6 < 200) print}' -
chr2 160 210 id-2 4 50
This result can be piped to wc -l
to get the number of filtered elements:
$ bedmap --echo --bases --delim '\t' test.bed | awk '{if ($6 < 200) print $0}' - | wc -l
1
Or we can pipe to cut
to get back filtered elements:
$ bedmap --echo --bases --delim '\t' test.bed | awk '{if ($6 < 200) print}' - | cut -f1-5
chr2 160 210 id-2 4
In all, this seems a more complicated approach than what you use for your example case. However, BEDOPS bedmap
can perform many statistical operations on BED elements, and so you might find it useful for other applications.
BEDOPS also has a very low memory footprint and scales efficiently to very large genomic datasets, whereas it can be easier to run into memory, performance and scaling issues with using R with large datasets.
+1 for a very nice R tutorial link
thx! very nice R tutorial!