Hello everyone,
I feel as though I'm having a conceptual issue when it comes to processing CHiP-seq. I have a list of bed files that I want to convert into bedGraph and then into bigwig for visualization (as recommended in this post: converting Bed to Wig).
I have downloaded bed files from this GEO dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE117864
These peaks were called using HOMER's peakFinder and the data had to be wrangled to fit the proper format
chr1 237654 237866 23.5 +
chr1 714102 714314 51 +
chr1 793377 793589 90.5 +
chr1 805182 805394 211.5 +
Where columns are chrom, chromStart, chromEnd, score and strand respectively. The score used is the peakFinder score normalized to input.
When I go to convert these bed files into bedgraph using genomcov with the following code:
fetchChromSizes hg19 > mychrom.sizes
bedtools genomecov -i input.bed -g mychrom.sizes -bg > output.bedGraph
I get the following coverage calculated:
chr1 237654 237866 1
chr1 714102 714314 1
chr1 793377 793589 1
chr1 805182 805394 1
What exactly happened here, I thought that genomecov was supposed to summarize feature coverage, which would be in this case be the findPeaks score from HOMER (https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html), however it's seemingly all 1s.
Did I do something incorrect on my behalf? Why did I get this output from genomecov? All I want to do is convert bed into bedgraph and then into bigwig.
Thanks for your time.
Why do you feel you need to turn the peaks into a wig file? (Big)Wigs are usually generated to have a more efficient way of representing the genome-wide coverage based on millions of reads. Those short-reads are usually stored in BAM files (millions and millions of rows, one per read, i.e. the size of the BAM file will depend on the number of reads in your experiment), which can become unwieldy. To have a more efficient representation, we can chunk up the genome in equal-size bins and just calculated how many reads per bin we have (this is what the bedTools command above would do). This will, in most cases, generate a smaller file, although the number of rows in the final file now depends on the bin size, i.e. if you have 1bp bins, the number of rows will correspond to the number of bp in the genome, if you have 100 bp bins, it will be [genome size/100] and so on.
Have a look at the schema here the illustrates the relationships between the different file types.