my.matrix <- as.matrix( ExomeCount.dafr[, my.choice$reference.choice, drop = FALSE])
my.reference.selected <- apply(X = my.matrix, MAR = 1, FUN = sum)
Question #1:
In this test case there are 3 samples selected as the appropriate reference set. Why did they use FUN = sum instead of FUN = mean? Wouldn't that make the reference set counts a lot higher than the test set counts?
Question #2:
The other question is regarding the reference set selection - it looks like from the manual all the samples other than the one being tested gone through the function select.reference.set. However, if I have 14 cancer samples and 2 normal samples - shall I only input 2 normal samples to select.reference.set? Or shall I use all the 15 samples that is not the current tested one?
Question #1: Fair observation, but I guess that they normalize by the total number of reads so summing or averaging is not a big difference. You ALWAYS need to normalize for the different total throughput of different experiments.
Question #2: It depends. At page 5 of the manual they say: "A key idea behing ExomeDepth is that each exome should not be compared to all other exomes but rather to an optimized set of exomes that are well correlated with that exome". My first choice would be to use the healthy tissues as reference. The best idea is always to compare cancer tissue of a patient with his blood (or at least with healthy tissue from the same patient.). The second choice is to compare cancer samples with healthy tissue (even if they originate from different patients). I wouldn't use another cancer sample (even if it's a different kind of cancer) as a reference, since it might have the same CNVs of your sample, thus making you lose results and in addition cancerous tissues might have very messy CNV situations, thus biasing results.