So I have this matrix for some TF
A C G T
0.182685 0.047046 0.764736 0.005533
0.621105 0.000331 0.378534 0.000030
0.000282 0.000176 0.999153 0.000388
0.000150 0.000389 0.115783 0.883678
0.008410 0.014332 0.028947 0.948312
0.000897 0.979369 0.001293 0.018441
0.899443 0.003277 0.091986 0.005294
0.108200 0.260626 0.048650 0.582524
0.171072 0.360522 0.074511 0.393895
0.187325 0.069370 0.728580 0.014725
0.621060 0.000420 0.378429 0.000090
0.000211 0.000070 0.999542 0.000176
0.000063 0.000221 0.079208 0.920508
0.008522 0.012670 0.040214 0.938594
0.000147 0.963162 0.001536 0.035155
0.897135 0.003258 0.096065 0.003543
For what I want to do, I would like to extract all the possible sequences from this matrix given a frequency threshold.
For example, if I have:
0.897135 0.003258 0.096065 0.003543
I want to have A
but if I have
0.621105 0.000331 0.378534 0.000030
I want to have A or G.
and if no frequency match the threshold for a position, I would like to have N.
The purpose is to store all this possible sequences in file, and then to grep a set of sequences for that file.
I don't know how to do this. Maybe it would be easier to just take a consensus with extended base symbol like this: RRGKTCRNNRRGTTCR
and translate this to have all the possible sequences.
As it require more than four lines of code, this is beyond my actual capacities.
Do you have a solution? Thank you all!
Please specify your thresholds precisely. It's basically like this (0.3 treshold):
Not that elegant and will include empty space at the end of line (removable by editing the last printf).
Threshold would be 0.1.
This is a first step, however, I still need to format the output to linear sequences.
it looks like a PWM. what is your ultimate goal ?
Yes, it's a PWN (from a transfac file)
My goal is not to search my set of sequence for any sequences that match the consensus, but to do the reverse, to search for occurrences of "canonical" sequences derived from the consensus.
The goal is to see if specific sequences are found under the peaks that activate a subset of genes, and if theses groups of genes have a similar function.
For example, is GAGTTCA found in up regulated genes that are responsible for calcium transport whereas GGGTTCA is found in up-regulated genes that are involved in glucose uptake ?