Extracting sequences from frequency matrix
1
0
Entering edit mode
9.0 years ago
giroudpaul ▴ 70

So I have this matrix for some TF

A           C           G           T
0.182685    0.047046    0.764736    0.005533
0.621105    0.000331    0.378534    0.000030
0.000282    0.000176    0.999153    0.000388
0.000150    0.000389    0.115783    0.883678
0.008410    0.014332    0.028947    0.948312
0.000897    0.979369    0.001293    0.018441
0.899443    0.003277    0.091986    0.005294
0.108200    0.260626    0.048650    0.582524
0.171072    0.360522    0.074511    0.393895
0.187325    0.069370    0.728580    0.014725
0.621060    0.000420    0.378429    0.000090
0.000211    0.000070    0.999542    0.000176
0.000063    0.000221    0.079208    0.920508
0.008522    0.012670    0.040214    0.938594
0.000147    0.963162    0.001536    0.035155
0.897135    0.003258    0.096065    0.003543

For what I want to do, I would like to extract all the possible sequences from this matrix given a frequency threshold.

For example, if I have:

0.897135    0.003258    0.096065    0.003543

I want to have A

but if I have

0.621105    0.000331    0.378534    0.000030

I want to have A or G.

and if no frequency match the threshold for a position, I would like to have N.

The purpose is to store all this possible sequences in file, and then to grep a set of sequences for that file.

I don't know how to do this. Maybe it would be easier to just take a consensus with extended base symbol like this: RRGKTCRNNRRGTTCR

and translate this to have all the possible sequences.

As it require more than four lines of code, this is beyond my actual capacities.

Do you have a solution? Thank you all!

motif matrix sequence • 1.5k views
ADD COMMENT
0
Entering edit mode

Please specify your thresholds precisely. It's basically like this (0.3 treshold):

awk -F '\t' '{if($1>=0.3)printf "%s ","A"} {if($2>=0.3)printf "%s ","C"} {if($3>=0.3)printf "%s ","G"} {if($4>=0.3)printf "%s ","T"} {printf "%s\n",""}' inpuFile

G
A G
G
T
T
C
A
T
C T
G
A G
G
T
T
C
A

Not that elegant and will include empty space at the end of line (removable by editing the last printf).

ADD REPLY
0
Entering edit mode

Threshold would be 0.1.

This is a first step, however, I still need to format the output to linear sequences.

ADD REPLY
0
Entering edit mode

it looks like a PWM. what is your ultimate goal ?

ADD REPLY
0
Entering edit mode

Yes, it's a PWN (from a transfac file)

My goal is not to search my set of sequence for any sequences that match the consensus, but to do the reverse, to search for occurrences of "canonical" sequences derived from the consensus.

The goal is to see if specific sequences are found under the peaks that activate a subset of genes, and if theses groups of genes have a similar function.

For example, is GAGTTCA found in up regulated genes that are responsible for calcium transport whereas GGGTTCA is found in up-regulated genes that are involved in glucose uptake ?

ADD REPLY
1
Entering edit mode
9.0 years ago
Jimbou ▴ 960

You can use also R and change the threshold as you need:

df <- structure(list(A = c(0.182685, 0.621105, 0.000282, 0.00015, 0.00841
), C = c(0.047046, 0.000331, 0.000176, 0.000389, 0.014332), G = c(0.764736, 
0.378534, 0.999153, 0.115783, 0.028947), T = c(0.005533, 3e-05, 
0.000388, 0.002, 0.948312)), .Names = c("A", "C", "G", "T"), row.names = c(NA, 
-5L), class = "data.frame")
# data
df
# function
hit <- function(x,ts){
  if(!any(x >= ts)){ return("N")}else{
    c("A","C","G","T")[which(x >= ts)] 
  }}
# get the sequence
paste(unlist(apply(df, 1, hit, 0.3)), collapse = "")
[1] "GAGGNT"
ADD COMMENT

Login before adding your answer.

Traffic: 2313 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6