I have a set of biological count data within a data frame in R which has 200,000 entries. I am looking to write a function that will identify the peaks within the count data. By peaks, I want the top 50 count data. I am expecting there to be multiple peaks within this dataset as the median value is 0. When inputting:
> summary(df$V3)
My output looks like this:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 0.00 1.82 1.00 94746.00
I want to write a function that will list the peaks and then look at the numbers on either side of the peaks (+1 and -1) to produce a ratio. Can anyone help with this?
My dataframe looks like this and is labelled df:
V1 V2 V3
gene 1 6
gene 2 0
gene 3 0
gene 4 10
..
My expected output would be a data frame identifying the peaks, and at what position (V2) within this dataset so I can examine the numbers on either side of the peaks to produce a ratio for analysis.
How do you define a peak ? And it is not clear to me what is your expected output with the ratios (ratio of what numbers ?)
I will take the ratio from the numbers either side of the peak. So if there was a peak at position 4, I would take the counts at position 3 and 5 and produce a ratio.
Why did you delete the post?
Can you detail? give an example maybe ?
Peaks are usually thought of as a max value within a stream of ordered data. While there are a lot of methods for finding peaks, it would be hard to recommend something without knowing more details about your peak expectations for finding peaks. For instance, what resolution is required to define a peak? If you were to scan down the column of your data frame looking for max values - what size universe or window would be appropriate? i.e. will your peaks appear in windows of 3 data points? 30 or 300 data points? If your windows were 100 data points in size you could have as few as 2000 windows to evaluate, or as many as 200,000 - 100. Evaluating a window for peakiness will differ depending on whether you expect all windows to contain peaks or few windows to contain peaks. The method you use to evaluate whether a window of a given size contains more data or less data than expected by chance depends on the nature of your data. Things could be simple or complex, but without more detail it's hard to distinguish a quick hack approach versus something more reasoned.
Taking what you've said literally, you could find the 51st highest value from your dataset (max), step through the data 3 data points at a time, and keep those where the middle data point is > max, along with the ratio of first to last. Just do that for 1:(200,000-2) and see what comes back.