Sampling protein coding genes of the same length distribution as another set of elements using R (GRanges)
1
0
Entering edit mode
9.0 years ago

Hello,

I have a set of elements with the following distribution of lengths:

summary(width(positivelincrnas))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    470    4164    9872   18940   20790  152600

and another dataset with the following distribution:

summary(width(positivegeneshg19))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
     20    5558   20460   59880   58360 4829000

I would like to get elements from the second dataset (genes) such that they are of the same length distribution as the first set of elements (lincrnas). Both objects are GRanges objects.

Any suggestions?

Thanks a lot,
Dimitris

R GRanges • 2.0k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
1
Entering edit mode
9.0 years ago

In order to match the length distributions, you can compute a density estimate from the first data set and sample from the second data set considering that density. Let's assume we have two GRanges object: gr1 (positivelincrnas) and gr2 (positivegeneshg19). The trick here is to use a weighted sampling scheme where the probability is derived from the distribution of the first dataset.

bins = seq(1000, 25000, by = 1000) ## choose according to your dataset
h = hist(width(gr1), bins, plot = FALSE)
idx = cut(width(gr2), bins, labels = FALSE)
gr2matched = sample(gr, final_size, prob = h$density[idx]) ## adjust the 'size' and 'replace' arguments
ADD COMMENT

Login before adding your answer.

Traffic: 2598 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6