Count matrix from bigwig files - Application
1
0
Entering edit mode
5.2 years ago
pyKey ▴ 70

Hi All,

Does anyone know a software that can calculate gene count matrix from a bigwig file? Preferred Python.

Thanks,

bigwig gene expression seq python • 7.0k views
ADD COMMENT
2
Entering edit mode

ATpoint raised important issues here and I tend to agree with him. That being said, deeptools multiBigwigSummary used with the options --outRawCounts and --BED (with a GTF file) can summarize read counts on exons from bigwig files.

ADD REPLY
1
Entering edit mode

By best knowledge you cannot reliable do this. The bigwig format is a per-base or per-window format that summarizes coverage over these bases or windows but it is not suited to give exact counts over e.g. exons etc. That information you seek is lost.

ADD REPLY
0
Entering edit mode

Not sure if it is a good idea either, nevertheless, some people seem to do that Here . Just that, there is no code published.

ADD REPLY
0
Entering edit mode

I would never do this. Maybe if it was my own data, I was 100% sure how the bigwig was created and I had absolutely no other choice because the raw data were gone and I urgently needed that information, but else, no never. If you are going to do it anyway, treat whatever results you get with care and try to validate them with independent approaches.

ADD REPLY
0
Entering edit mode

Thank you a lot for this hint. I would do as you suggested.

ADD REPLY
1
Entering edit mode

Hi All,

So I ran multiBigwigSummary on the bigwig file. As follows:

  1. Prepared a BED file which contains just "gene" entries from the original gff file.
  2. Ran the above given this BED file.
  3. Got an outout BED with "gene" coordinates and a count column.
  4. Mapped the coordinates of each row to a single gene IDs from the gff file.

Then I checked the correlation between these counts and original expression data and it was around 0.7.

As a sanity check I also ran the same pipeline on a bigwig for a known file with already calculated gene expression levels. The correlation is around the same!

I wonder if there is a more reliable way to get gene expression from a BED file. Or is this information loss inherent?

Thank you,

ADD REPLY
0
Entering edit mode

Hi, Here are some thoughts about your problem:

With only the gene lines of a gtf, you'll also get also the introns. These are usually ignored while computing gene expression.

I don't know how deeptools summary works, but most gene counting tools rely only on uniquely mapping reads and discard ambiguous assignments. I guess it's hard to assess the ambiguity based on coverage rather than assigning a read by the portion of overlap to a certain feature.

BED files are generally not complex enough to model the multi-exon multi-transcript genes. You can convert a GTF/GFF file into BED12 format but loose the transcript-gene relationship.

ADD REPLY
2
Entering edit mode
5.2 years ago
pyKey ▴ 70

Update. Hope this helps. After looking more closely into multiBigwigSummary: read counts in each region are averaged over the number of bins. Meaning that the end score is dependent on the gene length. After normalizing gene counts from the original bam by length, now the correlation score is 0.95.

I agree that in most cases focusing on gene boundaries for getting BED scores is not the best approach. In Spombe yeast though, many genes are signle-exon which makes an acceptable "approximation" of expression. Of course, finding the exact expression values needs some workaround.

Best,

ADD COMMENT

Login before adding your answer.

Traffic: 1825 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6