I am looking at a particular genomic feature in two sets of genes: set A is a positive control set, where I know this mark is overall enriched in the genomic DNA of these genes, and set B is a (much larger) negative control set, where it occurs at a lower, background frequency. I wanted to see if the distribution of this feature within each genomic region differs between set A and set B, and below I have plotted the density of this feature along the length of all genes in each set, from 0-100% of the transcribed gene length, along with 4 kb upstream and downstream (gray shading). Set A is red, set B is blue.
The distributions are different (and highly significantly so, by Kolmogorov-Smirnov test): the feature is uniformly distributed in set B, but shows a 3' (rightward) bias in set A, with almost no marks upstream of the transcribed region and an increased density of marks at and beyond the end of the transcribed region.
There are many genes in "set C," not plotted, for which we aren't able to assign membership in set A or B based solely on the number of marks per gene -- I would like to be able to use the information shown here as part of a classifier approach, weighting the value of marks in a new gene based on where they fall in its length. Clearly, a mark upstream of the TSS should be discounted for membership in set A, while a mark near the 3' end should be given greater weight. What I'd like advice on is the best way to extract quantitative information from the density comparison I have performed. Would it make sense to calculate the relative density at a given position here, and then apply that directly as a weight to marks found in other genes? Below is a plot of the relative density, calculated (in R) as density(setA$pos)$y / density(setB$pos)$y
, after calling density
with identical parameters for the two sets.
Another approach I've considered is to break the distribution into bins, like the histogram above, identify individual bins where the relative frequency significantly differs between set A and B, and for those bins specifically use the ratio as a weighting factor. In any event, this would be one of several factors I would use for weighting -- essentially, I'm looking for features beyond the total number of marks that make it possible to distinguish additional members of set A. I should note that I have held out a large number of independent samples, to validate whatever weighting procedure I come up with.
I would appreciate any suggestions for statistically-valid methods to leverage the different distributions that I've identified. I guess this is a two-part question, the first of which is probably easiest to answer: (1) what is the best way to get useful weights from this comparison of feature density, and (2) what is the best way to use such marks? The simplest, and certainly dumbest, thing to do would be to scale my raw counts by a weighting factor, add them up for each gene and round to the nearest integer, and then perform a chi-squared or similar analysis (which is what I originally did, with raw counts, to identify the set A genes). I assume any statistician would consider this a basis for justifiable homicide, though.