If I have a set of nucleotide sequences of varying lengths and I want to find the number of times a certain subsequences occurs for each sequence in this set, what is the best way to normalize so I can compare the number of hits each sequence has? I feel that longer sequences will (maybe) be more likely to contain more hits so I need to take gene length into account. My first instinct is to just divide by gene length (or by kilobases) to get a frequency per x comparison but I don't know if this is meaningful. How is this kind of comparison normally done?
A related question I have is how I can normalize by position. Say I want to see if in general more hits tend to occur at the beginning of the coding sequence or end of the 5'UTR. How can I normalize so each gene is say, of length 1-100 (including UTRs) that way I can plot a distribution of sequence occurrence by (scaled) position? I know how this could be done but I just want to make sure I do it in a way that is generally considered acceptable.
There is many type of normalization you do. See seqinr package in R. It can provide some tools like z score,rho... For analyse I think that you have use X2 to test independancy or simply you can use Ecludic or correlation distance
If I were to use Rho, how might I approach cases where there are multiple options? The
Rho()
function will give me over- and under-representation for combinations of a specified size. But what if I want to know the Rho for two different sequence? Or ones with N's? For example, what if my sequence is "CCWGA?. This one is actually "CCAGA" or "CCTGA". For another case, what about "RACNNNGC"? How can combine Rho values for multiple cases?The Rho is absolute calculation of independency between a sequence with alphabet z. As i know ,you can edit the alphabet in the sequence. Missing value takes absolutly p=1. For other upac use the sum of letters frequency. Try then to modify the rho function to get setting options.