I have found a sequence logo in a paper which shows the consensus sequences obtained from CLIP tags for an RNA-binding protein. The sequence is 45 nucleotides long and shows the number of bits for the different nucleotides at each position. I would like to see if this consensus sequence can be found in an RNA of interest to predict if the RBP is likely to bind to this RNA.
The issue I have run into is that there are many variations in the possible consensus sequences. These variations could be due to 1) Length - the consensus sequence need not be the full 45 nucleotides but could be something shorter and 2) there are numerous possible consensus sequences depending on what nucleotide is chosen at each position for the sequence logo. Together this produces a large list of potential consensus sequences.
I have devised the following approach using R:
1) Find all possible permutations of consensus sequences of varying lengths or nucleotides for each position.
2) Rank each consensus sequence based on the total number of bits represented in each sequence. This will allow me to determine which sequences are more likely to be representative of the actual consensus sequence.
3) Take this full list of consensus sequences and see which align to my RNA of interest. If the aligned sequences are of high rank I will have a greater degree of confidence that the RBP actually binds to this RNA in vivo.
My question: Is this approach valid? Or is there another more standard way of approaching this problem used in the field?
If there are R packages out there for this I would appreciate recommendations.