wig2bed and counting the reads in the peaks
1
1
Entering edit mode
2.1 years ago
Bogdan ★ 1.4k

Dear all,

I would appreciate having your suggestions about ab analysis that I am performing.

I do have several wig files (A, B, C, D..) , and a list of peaks (P).

The goal is to be able to count the reads in the peaks listed in P in such a way that I can use the results for an analysis of differential binding (with DiffBind, edgeR, DESeq2, etc).

I have used the following two commands :

<1> wig2bed (bedops package) in order to transform the wig files into bed.

wig2bed -x < "${FILE%.bigwig}.wig"  > "${FILE%.bigwig}.bed" 

<2> coverageBed --counts (bedtools package)

coverageBed -counts -a PEAKS -b "${FILE%.bigwig}.bed"

The question is the following : what is the proper way to compute the counts in the peaks (P), given the fact that the command wig2bed produces a bed file that has 5 fields (chr, start, end, ID, score), as it shown below. More specifically, how shall I convert the scores (column 5) into peak intensity (counts) ?

1       794250  794350  id-301  1.533330
1       794350  794400  id-302  3.066660
1       794400  794450  id-303  1.533330
1       794450  795800  id-304  0.000000
1       795800  795900  id-305  1.533330
1       795900  795950  id-306  0.000000
1       795950  796000  id-307  3.066660
1       796000  796350  id-308  1.533330
1       796350  796500  id-309  0.000000
1       796500  796550  id-310  1.533330
1       796550  796650  id-311  0.000000
1       796650  796700  id-312  1.533330
1       796700  796750  id-313  4.599990

Thank you.

Bogdan

bedtools ChIP-seq deeptools bedops • 892 views
ADD COMMENT
2
Entering edit mode
2.1 years ago
ATpoint 85k

That information is lost. Look at this example:

|               ---- |
|  -------------     |
| -------------      |
|   -------------    |
|      ------------  |
|01233344444444332210|

The dashed horizontal lines are reads and then numbers indicate coverage within that peak, so basically a bedGraph/Wig-like format. The correct count for that peak would be 5 ("correct" in terms of how tools like featureCounts would return it) as it intersects with 5 reads, but there is no way you could know that it is 5 by just looking at the coverage values. Obviously the max of the coverage would be 4 but not 5, the average would not be 5 either, nor would the median be 5. Depending on read length and read distribution within the peak and soft-clipping of reads by the aligner, any estimates based on coverage could wildly be off. I personally would not start from track files for a differential analysis, far too much uncertainty. In the end you anyway need raw data for publication or to ensure reproducibilty of the entire analysis from fastq to figures.

If you google you may find this but this is still only an estimate, the read length in this approach is assumed to be known and constant and depending on trimming of data and soft-clipping of the aligner I personally would assume that results will be approximate at best, but I have not read the linked approach in all detail, nor tested it so take it with a grain of salt.

ADD COMMENT
0
Entering edit mode

Thank you for the quick reply.

For this particular situation, I have only the coverage files in order to do the differential analysis. Thank you for pointing to the resource http://lcolladotor.github.io/protocols/bigwig_DEanalysis/

As a side note, I was wondering if you folks have tested MACS bdgdiff or bedmap (bedops):

<1> does the use of MACS2 bdgdiff function give reliable results ?

https://pypi.org/project/MACS2/

<2> has anyone tested bedmap (bedops) with the option :

--count : The number of overlapping elements in <map-file>

https://bedops.readthedocs.io/en/latest/content/reference/statistics/bedmap.html,

Thanks again !

Bogdan

ADD REPLY

Login before adding your answer.

Traffic: 2537 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6