How do you remove coverage outliers in sequencing count data?
2
0
Entering edit mode
8.6 years ago
biohack92 ▴ 170

I've recently looked at methylation coverage (bisulfite seq data) in IGV, and there are obvious coverage outliers which I'm interpreting as mismappings of repetitive sequences. How do you identify these outliers from methylation calls/count data (I don't have BAM/SAM files) and remove them?

bisulfite sequencing qc R quality control • 3.0k views
ADD COMMENT
3
Entering edit mode
8.6 years ago

The quantile function in R is handy in these cases. To discard datapoints in the top 1% (and keep the bottom 99%) do:

xv<- rnbinom(10000, 1, 0.1)     ## Test data
qqcut<- quantile(xv, p= 0.99)   ## Cut off to use in R or awk
xvcut<- xv[xv < qqcut]          ## Keep only points in lower 99%
ADD COMMENT
1
Entering edit mode
8.6 years ago

Load the coverage into R, plot the distribution, and then use awk to filter the files given some reasonable threshold given the coverage distribution.

ADD COMMENT
0
Entering edit mode

Thanks @Devon Ryan. I followed your advice and this is the plot I created. X-axis shows the # of reads/counts and Y-axis is the frequency of each count. Is there a way to determine which threshold is 'reasonable'?

ADD REPLY
3
Entering edit mode

You could make a qq-plot, but it's generally OK to just eye-ball things. From the distribution, it looks like a threshold around 200 would be reasonable.

ADD REPLY

Login before adding your answer.

Traffic: 922 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6