Question

transcription site search tools in bacterial gene upstreams

0

Entering edit mode

6.4 years ago

natasha.sernova ★ 4.0k

Dear all,

I have a few hundreds of bacterial genomes.

I would like to find some transcription factor binding sites.

I use an ancient tool for it – I build a PWM-matrix and use some

home-made script to search for suitable sites in bacterial gene upstreams.

My output now for any transcription site is the following:

genome_id# gene_id# trans_site_position# trans_site_weight# trans_site_nucl_sequence# trans_site_name#

The post below contains some relatively recent refs to PWM - matrix.

PWM matching alogrithm This is an old approach.

My question is: what other tools exist now and how to run them on my *.gb-files?

Thank you very much for your help!

Natasha

gene upstream bacterial genome TranscFactor • 1.4k views

ADD COMMENT • link updated 6.4 years ago by Asaf 10k • written 6.4 years ago by natasha.sernova ★ 4.0k

score 2 · Accepted Answer · 2019-04-03

2

Entering edit mode

6.4 years ago

Asaf 10k

There are several approaches for motif finding starting with a regular expression to neural networks with PWM and HMM in the middle (SVM classifier also perhaps). There is a tradeoff between the ability of the model to represent a complex motif to the amount of data needed to generate the model. To define a regular expression you will probably need a handful of sequences, you will need a bit more to define a useful PWM, more to train a HMM profile and a lot to train a SVM or a NN.

With TFBS it's a question of the amount of data that you have and the complexity of the binding site. Usually a PWM will work, a HMM will model dependencies between adjacent positions which might be useful, it's a matter of training data availability and biological reasoning.

ADD COMMENT • link 6.4 years ago by Asaf 10k

0

Entering edit mode

Dear Asaf, Many thanks for your comment!

I have 200 bacteria. Approximately each 3-5 bacteria are

close relatives, a single PWM is perfect for them – small site distance

(<100 nucleotides) and high binding site weight(>5.0).

But next group has a different PWM, since their output from the first

matrix shows larger site location distance (>100) and smaller site weights (about 4.5).

The third group may return to the first PWM, but it’s impossible to predict

such a behavior beforehand. The worst result is distance > 200 and

binding site weight < 4.0. It’s a signal – I have to change my PWM.

I have the only home tool for that, it’s definitely not enough.

And I wouldn’t like to do this check manually anymore.

It usually takes too much time and efforts.

Could you, PLEASE, recommend me some articles and soft to deal

with my problems? I feel a smell of NN, but I may be wrong.

Many-many THANKS!!

Natasha

ADD REPLY • link 6.4 years ago by natasha.sernova ★ 4.0k

1

Entering edit mode

How related are those bacteria? You are dealing with a lot of uncertainties here - you're not even sure there is a TFBS where you're looking. Maybe, if you have a list of genes in each bacteria, you can run MEME on each bacteria to get a PWM and then compare the binding sites or the PWM weights.

ADD REPLY • link 6.4 years ago by Asaf 10k

1

Entering edit mode

Thank you, I will try MEME.

Do you know other tools that produce PWM?

Addition:

I should have found this post earlier, sorry!

How can I create a more accurate PWM?

MEME is really useful. And the whole right panel of the post above as well.

Converting motif databases from meme suite to other formats

And another post below where I'd wtitten an answer by myself...

Is there any paper about motif finding based on PWM on genome sequence?