I was wondering if it was possible to convert/collapse bedgraph format files to BED format.
I wish to create venn diagrams for chip-seq data.
For my 1st dataset after running MACS, I get both BED and Bedgraph format data (i changed option to get bedgraph instead of wig). However, for my 2nd dataset I only have the data in BED format.
The bedgraph files look nice for visualization on ucsc genome browser because it shows peaks but the R package ChIPpeakAnno requires BED files.
BedGraph has 4 fields (chr start stop score), and the simplest BED file has 3 fields (chr start stop), so one easy way to convert from BED to bedGraph would be to use awk:
Or you can use tail to return everything but the first line (if your track line is the first line). If you want to add more fields to the bed file, you can insert them in the print statement (I don't remember the minimum requirement for ChIPpeakAnno).
It's not clear that a simple format conversion is what you want . . .
In your question, you ask about "collapsing" a bedGraph file. This isn't trivial, and this problem is exactly what peak-callers like MACS solve.
The bedGraph format is typically used to visualize the genome-wide "signal". If it's output from a MACS run, the bedGraph will not be just the peaks, rather, it will show the coverage of all your reads shifted by d/2. It will have many, many more lines than a BED-format "peaks file" (try running wc -l on your files to confirm).
If your goal is to get a Venn diagram to compare peaks of two treatments, then comparing a BED file with a converted bedGraph file will not result in a meaningful diagram. Instead, you'll need to re-run a peak caller to get a BED file of peaks for each treatment.
what do you mean shifted by d/2?
I know it will have many more lines than the corresponding 'peaks' file but i thought it was odd that MACS would generate a bedgraph file when I could have made a bedgraph file from the bam file that I inputed into MACS
Check out Fig 1 in the MACS paper [1] and Fig 1 in the spp paper [2] -- for ChIP-seq reads, you end up with a pileup of reads on either side of a putative binding site. If you were to generate a bedGraph out of the BAM, you'd have these double peaks everywhere. MACS computes the average distance between peaks and shifts them toward each other so that the final bedGraph has peaks right where the binding sites should be.
I have encountered a strange problem now. I am using bedgraph files to look at things on UCSC and when I go from sorted_bam -> bedgraph (900k lines) but when i go sorted_bam -> MACS -> bedgraph (1000k lines) the file that MACS gives me is bigger and also outside of the coordinates (UCSC gives out of bounds errors) even though if I look at beginning of the SAM file the chr_size is correct (@SQ lines), maybe macs mailing list is a better place for this question...
MACS also extends by expected fragment size; not sure if it pays attention to BAM @SQ lengths when doing so. UCSC will generally tell you what line the problem is on -- so you can open the file and delete from that line to the end of the chrom and try again. MACS mailing list is pretty responsive for these sorts of questions.
cut -f 1-3
is your friend too.