Hi All,
Does anyone know a software that can calculate gene count matrix from a bigwig file? Preferred Python.
Thanks,
Hi All,
Does anyone know a software that can calculate gene count matrix from a bigwig file? Preferred Python.
Thanks,
Update. Hope this helps. After looking more closely into multiBigwigSummary: read counts in each region are averaged over the number of bins. Meaning that the end score is dependent on the gene length. After normalizing gene counts from the original bam by length, now the correlation score is 0.95.
I agree that in most cases focusing on gene boundaries for getting BED scores is not the best approach. In Spombe yeast though, many genes are signle-exon which makes an acceptable "approximation" of expression. Of course, finding the exact expression values needs some workaround.
Best,
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
ATpoint raised important issues here and I tend to agree with him. That being said, deeptools multiBigwigSummary used with the options
--outRawCounts
and--BED
(with a GTF file) can summarize read counts on exons from bigwig files.By best knowledge you cannot reliable do this. The bigwig format is a per-base or per-window format that summarizes coverage over these bases or windows but it is not suited to give exact counts over e.g. exons etc. That information you seek is lost.
Not sure if it is a good idea either, nevertheless, some people seem to do that Here . Just that, there is no code published.
I would never do this. Maybe if it was my own data, I was 100% sure how the bigwig was created and I had absolutely no other choice because the raw data were gone and I urgently needed that information, but else, no never. If you are going to do it anyway, treat whatever results you get with care and try to validate them with independent approaches.
Thank you a lot for this hint. I would do as you suggested.
Hi All,
So I ran multiBigwigSummary on the bigwig file. As follows:
Then I checked the correlation between these counts and original expression data and it was around 0.7.
As a sanity check I also ran the same pipeline on a bigwig for a known file with already calculated gene expression levels. The correlation is around the same!
I wonder if there is a more reliable way to get gene expression from a BED file. Or is this information loss inherent?
Thank you,
Hi, Here are some thoughts about your problem:
With only the gene lines of a gtf, you'll also get also the introns. These are usually ignored while computing gene expression.
I don't know how deeptools summary works, but most gene counting tools rely only on uniquely mapping reads and discard ambiguous assignments. I guess it's hard to assess the ambiguity based on coverage rather than assigning a read by the portion of overlap to a certain feature.
BED files are generally not complex enough to model the multi-exon multi-transcript genes. You can convert a GTF/GFF file into BED12 format but loose the transcript-gene relationship.