Entering edit mode
9.1 years ago
jolespin
▴
150
So I'm trying to look at k-mer frequencies for a bunch of genes and they are different lengths. If they were all the same length then counts would be a good measure. I'm going to normalize them by dividing each count by the length of the sequence. Is that the right way to do it? Is there another normalization method that is typically used for this type of analysis?
Well, for obtaining frequencies, you should divide by the total number of counts. For looking at a particular gene, this will usually be L-k+1 (L=length of gene). However, it is more save to sum up the counts and then divide IMHO. For example, if you do some creepy kind of counting like counting only k-mers at even positions (for whatever reason!) and divide by L-k+1, your normalized count-vector would not sum up to 1.
[EDIT:] Depending on you downstream analysis, you can also normalize to a vector-length of 1 (Euclidean norm).