How to normalize K-mer counts for genes of different length
0
0
Entering edit mode
9.1 years ago
jolespin ▴ 150

So I'm trying to look at k-mer frequencies for a bunch of genes and they are different lengths. If they were all the same length then counts would be a good measure. I'm going to normalize them by dividing each count by the length of the sequence. Is that the right way to do it? Is there another normalization method that is typically used for this type of analysis?

gene RNA-Seq kmer genome sequence • 3.1k views
ADD COMMENT
1
Entering edit mode

Well, for obtaining frequencies, you should divide by the total number of counts. For looking at a particular gene, this will usually be L-k+1 (L=length of gene). However, it is more save to sum up the counts and then divide IMHO. For example, if you do some creepy kind of counting like counting only k-mers at even positions (for whatever reason!) and divide by L-k+1, your normalized count-vector would not sum up to 1.

[EDIT:] Depending on you downstream analysis, you can also normalize to a vector-length of 1 (Euclidean norm).

ADD REPLY

Login before adding your answer.

Traffic: 1882 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6