I have some atac-seq and chip-seq data which I have processed using deeptools to generate normalized bigwig files. I am now using these bigwig file to plot profiles over regions of interest but am having problems with outliers distorting the plots in certain regions. I was wondering if there is any tool that can be used to remove outlier regions from bigwig files?
Thanks for the replies. I have already removed pcr duplicates from the bam files but I still get outliers in some areas in both control and treatment samples. These spikes are up to 20 or 30 times higher than the average coverage in most areas.
I think I will try removing the top 1 percentile in R. Any suggestions for an R package the allows importing bigwig files as a matrix?
How do you define outliers ? regions with high coverage ? If you were to filter out those regions, then you might remove bound regions (if we talk about ChIP-seq) or other regions of interest. So I do not think this is a good strategy.
Alternatively, you might:
[quickfix] Use a fixed y scale for visualization (instead of going on auto-min-max). So even if the coverage reaches extremely high level, it will be capped at your chosen maximum.
[slower but probably better] Go back to the .bam alignment files, and mark and filter out duplicate reads. PCR or optical duplicates can cause reads pile-up artefacts. Once done, re-create the bigwig files.
You could winsorize your data based on quantiles. This means you first identify values that are e.g. greater than 99% of the data and then replace the values of these "larger data" by the value of this 99th percentile. Example code for some dummy numeric data below. This assumes that you first imported your data into R and now have it as either a matrix, data frame or similar container with numeric values:
## example numeric data
numeric.data <- rnorm(1000, 100,25)
## plot raw data
par(mfrow=c(1,2), bty="L")
boxplot(numeric.data, ylim=c(0, 200))
## identify the 99th percentile
quant99perc <- quantile(numeric.data, .99)
## replace values larger than that with the value of the 99th percentile
numeric.data[which(numeric.data > quant99perc)] <- as.numeric(quant99perc)
## plot winsorized data
boxplot(numeric.data, ylim=c(0, 200))
As you will see in the plot the outliers have been removed by scaling them down without changing all the other values.
Here a suggestion on how to import bigwig data into R:
However, one confusion I have is if I should normalize the bigwig files for each condition separately. For example, if I have one control and multiple treatment conditions, should I use the 99 percentile of each treatment to scale down its outliers or scale down all treatment conditions to the 99 percentile of my control sample. I am scaling everything separately for now but was wondering which is the better way.
Thanks for the replies. I have already removed pcr duplicates from the bam files but I still get outliers in some areas in both control and treatment samples. These spikes are up to 20 or 30 times higher than the average coverage in most areas.
I think I will try removing the top 1 percentile in R. Any suggestions for an R package the allows importing bigwig files as a matrix?
Thanks
Yes,
rtracklayer
, I updated my answer.