Removing outliers from bigwig files
2
0
Entering edit mode
4.8 years ago
srhic ▴ 60

Hello,

I have some atac-seq and chip-seq data which I have processed using deeptools to generate normalized bigwig files. I am now using these bigwig file to plot profiles over regions of interest but am having problems with outliers distorting the plots in certain regions. I was wondering if there is any tool that can be used to remove outlier regions from bigwig files?

Thanks

ChIP-Seq deeptools next-gen • 1.9k views
ADD COMMENT
0
Entering edit mode

Thanks for the replies. I have already removed pcr duplicates from the bam files but I still get outliers in some areas in both control and treatment samples. These spikes are up to 20 or 30 times higher than the average coverage in most areas.

I think I will try removing the top 1 percentile in R. Any suggestions for an R package the allows importing bigwig files as a matrix?

Thanks

ADD REPLY
0
Entering edit mode

Yes, rtracklayer, I updated my answer.

ADD REPLY
0
Entering edit mode
4.8 years ago

How do you define outliers ? regions with high coverage ? If you were to filter out those regions, then you might remove bound regions (if we talk about ChIP-seq) or other regions of interest. So I do not think this is a good strategy.

Alternatively, you might:

  1. [quickfix] Use a fixed y scale for visualization (instead of going on auto-min-max). So even if the coverage reaches extremely high level, it will be capped at your chosen maximum.

  2. [slower but probably better] Go back to the .bam alignment files, and mark and filter out duplicate reads. PCR or optical duplicates can cause reads pile-up artefacts. Once done, re-create the bigwig files.

ADD COMMENT
0
Entering edit mode
4.8 years ago
ATpoint 86k

You could winsorize your data based on quantiles. This means you first identify values that are e.g. greater than 99% of the data and then replace the values of these "larger data" by the value of this 99th percentile. Example code for some dummy numeric data below. This assumes that you first imported your data into R and now have it as either a matrix, data frame or similar container with numeric values:

## example numeric data
numeric.data <- rnorm(1000, 100,25)

## plot raw data
par(mfrow=c(1,2), bty="L")
boxplot(numeric.data, ylim=c(0, 200))

## identify the 99th percentile
quant99perc <- quantile(numeric.data, .99)

## replace values larger than that with the value of the 99th percentile
numeric.data[which(numeric.data > quant99perc)] <- as.numeric(quant99perc)

## plot winsorized data
boxplot(numeric.data, ylim=c(0, 200))

As you will see in the plot the outliers have been removed by scaling them down without changing all the other values.

enter image description here

Here a suggestion on how to import bigwig data into R:

ADD COMMENT
0
Entering edit mode

Thanks! I will try this

ADD REPLY
0
Entering edit mode

I tried this and it worked perfectly.

However, one confusion I have is if I should normalize the bigwig files for each condition separately. For example, if I have one control and multiple treatment conditions, should I use the 99 percentile of each treatment to scale down its outliers or scale down all treatment conditions to the 99 percentile of my control sample. I am scaling everything separately for now but was wondering which is the better way.

ADD REPLY

Login before adding your answer.

Traffic: 2009 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6