Question

Extracting sequences from frequency matrix

0

Entering edit mode

9.0 years ago

giroudpaul ▴ 70

So I have this matrix for some TF

A           C           G           T
0.182685    0.047046    0.764736    0.005533
0.621105    0.000331    0.378534    0.000030
0.000282    0.000176    0.999153    0.000388
0.000150    0.000389    0.115783    0.883678
0.008410    0.014332    0.028947    0.948312
0.000897    0.979369    0.001293    0.018441
0.899443    0.003277    0.091986    0.005294
0.108200    0.260626    0.048650    0.582524
0.171072    0.360522    0.074511    0.393895
0.187325    0.069370    0.728580    0.014725
0.621060    0.000420    0.378429    0.000090
0.000211    0.000070    0.999542    0.000176
0.000063    0.000221    0.079208    0.920508
0.008522    0.012670    0.040214    0.938594
0.000147    0.963162    0.001536    0.035155
0.897135    0.003258    0.096065    0.003543

For what I want to do, I would like to extract all the possible sequences from this matrix given a frequency threshold.

For example, if I have:

0.897135    0.003258    0.096065    0.003543

I want to have A

but if I have

0.621105    0.000331    0.378534    0.000030

I want to have A or G.

and if no frequency match the threshold for a position, I would like to have N.

The purpose is to store all this possible sequences in file, and then to grep a set of sequences for that file.

I don't know how to do this. Maybe it would be easier to just take a consensus with extended base symbol like this: RRGKTCRNNRRGTTCR

and translate this to have all the possible sequences.

As it require more than four lines of code, this is beyond my actual capacities.

Do you have a solution? Thank you all!

motif matrix sequence • 1.5k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.0 years ago by giroudpaul ▴ 70

0

Entering edit mode

Please specify your thresholds precisely. It's basically like this (0.3 treshold):

awk -F '\t' '{if($1>=0.3)printf "%s ","A"} {if($2>=0.3)printf "%s ","C"} {if($3>=0.3)printf "%s ","G"} {if($4>=0.3)printf "%s ","T"} {printf "%s\n",""}' inpuFile

G
A G
G
T
T
C
A
T
C T
G
A G
G
T
T
C
A

Not that elegant and will include empty space at the end of line (removable by editing the last printf).

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by 5heikki 11k

0

Entering edit mode

Threshold would be 0.1.

This is a first step, however, I still need to format the output to linear sequences.

ADD REPLY • link 9.0 years ago by giroudpaul ▴ 70

0

Entering edit mode

it looks like a PWM. what is your ultimate goal ?

ADD REPLY • link 9.0 years ago by GouthamAtla 12k

0

Entering edit mode

Yes, it's a PWN (from a transfac file)

My goal is not to search my set of sequence for any sequences that match the consensus, but to do the reverse, to search for occurrences of "canonical" sequences derived from the consensus.

The goal is to see if specific sequences are found under the peaks that activate a subset of genes, and if theses groups of genes have a similar function.

For example, is GAGTTCA found in up regulated genes that are responsible for calcium transport whereas GGGTTCA is found in up-regulated genes that are involved in glucose uptake ?

ADD REPLY • link 9.0 years ago by giroudpaul ▴ 70

Ram · Accepted Answer · 2015-11-26

You can use also R and change the threshold as you need:

df <- structure(list(A = c(0.182685, 0.621105, 0.000282, 0.00015, 0.00841
), C = c(0.047046, 0.000331, 0.000176, 0.000389, 0.014332), G = c(0.764736, 
0.378534, 0.999153, 0.115783, 0.028947), T = c(0.005533, 3e-05, 
0.000388, 0.002, 0.948312)), .Names = c("A", "C", "G", "T"), row.names = c(NA, 
-5L), class = "data.frame")
# data
df
# function
hit <- function(x,ts){
  if(!any(x >= ts)){ return("N")}else{
    c("A","C","G","T")[which(x >= ts)] 
  }}
# get the sequence
paste(unlist(apply(df, 1, hit, 0.3)), collapse = "")
[1] "GAGGNT"