as title, there is no overlap region in bed file. Thanks
as title, there is no overlap region in bed file. Thanks
You can do it with the following command line:
cat file.bed | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
Actually the bed files are 0-based for the positions with the end not included (see UCSC website http://www.genome.ucsc.edu/FAQ/FAQformat.html#format1). Hence, a bed with positions: 5-10 means positions 6-10 on the genome. So no need to add 1.
Given sorted input BED files A
, B
, C
, etc., an input BED file that defines bounds of chromosomes for the organism, e.g. hg19.extents.bed
(link) and BEDOPS 2.4.1 (or greater), you could do something like the following:
bedops --merge A B C ... | bedmap --echo --echo-map-size hg19.extents.bed - > answer.bed
If you just have one BED file, then the following will merge overlapping regions in one file, so as to calculate unique base length:
bedops --merge A | bedmap --echo --echo-map-size hg19.extents.bed - > answer.bed
See docs for merging and mapping for more detail.
Or you can pipe merged data into the aforementioned awk
statement:
bedops --merge A | awk ...
But the bedops | bedmap
pipeline preserves chromosome names and extent data.
awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}' file.bed
.
It works correctly .
In my opinion, some bed files contain forward and reverse strands.
In addition, some bed files contain regions overlaping with each other for paticular purpose.
The regions of repeated sequencing should be combined before calculating the sum of base numbers.
library(GenomicRanges)
library(dplyr)
library(data.table)
bed <- "/path/to/your/bed_file.bed"
bed.df <- fread(bed) %>% as.data.frame()
bed.gr <- makeGRangesFromDataFrame(
df = bed.df, keep.extra.columns = TRUE,
ignore.strand = FALSE,
starts.in.df.are.0based = TRUE
)
# Total bases before combining:
widthbed.gr) %>% sum()
# Combinine overlapping regions:
bed.gr.reduced <- reducebed.gr)
width(bed.gr.reduced) %>% sum()
def main(): fs = FS() args = get_arguments()
simple_length = 0
positions = set()
for line in fs.read2list(args.bed_file):
fields = line.split("\t")
chromosome = fields[0]
start_pos = int(fields[1]) + 1
end_pos = int(fields[2]) + 1
simple_length = simple_length + end_pos - start_pos
#print line
for i in xrange(start_pos, end_pos):
positions.add(chromosome + ":" + str(i))
if args.full_output:
print "Size of " + os.path.basename(args.bed_file) + ":"
print '{:,}'.format(len(positions))
if len(positions) == simple_length:
print "There are NO overlap regions in the bed file!"
else:
print "There are OVERLAP regions in the bed file!"
else:
print len(positions)
fs.close()
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Can you post the
head
of your bed file, most of them, has a start and end co-ordinate, so you could just subtract start from end and sum everything!!Do your BED data contain overlapping regions, or are your regions disjoint? If the latter, or if you don't care if regions overlap, then a basic
awk
statement as shown in one answer will suffice. Otherwise, let us know and I'll suggest another method that accounts for both cases.