I have a set of kmer counts coming from 2 groups. The first and second group have 25 RNA-seq samples each. I'm interested in identifying kmers that appear to have counts that are different between the 2 groups. In other words, for example, i have the 3mer AAT counts for each sample in both groups. I want to test whether the number of occurrence of this 3mer is significantly different between the 2 groups. Note here that I normalize my data to account for different library sizes in the different samples. Would it be correct to address this problem as trying to test whether the two distribution are significantly different (e.g., test whether the distribution of the 3mer AAT in the first group is significantly different than the distribution of the 3mer AAT in the second group)? In that case I could use a statistical test such as Kolmogorov–Smirnov test or is there a better approach to tackle this problem?
thanks
Are you expecting a different answer than when you posed a similar question (k-mer analysis in RNA-seq) yesterday?
yes because I don't think we could use DESEQ for this problem given the fact that we are not trying to detect deferentially expressed genes here...
In essence it is the same, though. Doesn't matter what your names are (Gene names or K-mer names). You should go with one of the promimnent tools since you most likely get a distribution which can be modelled by NB and thus using DESeq2, edgeR etc... is the best choice...
The question boils down to asking whether counts, that are likely well described by a negative binomial distribution, are changed by a treatment. DESeq2/edgeR/etc. are just implementations of such a GLM-based testing procedure, so they can still be used.
:) almost simultaneously
I guess the internet latency to Bonn is a bit longer than to Stuttgart :P
Depends on a test, but it may be a good idea to get rid of infrequent kmers -- kmers with frequency 1 may account for a large portion of your kmer set and are a product of seq. errors (as opposed to true biological signal).