Question

How to find sequence patterns in genome?

0

Entering edit mode

9.5 years ago

Parham ★ 1.6k

Hi,

I want to find a pattern of sequence in a genome. Let's say to find following pattern (G4N(1-10))5 that translates to 4 Guanines followed by 1 to 10 bases of either A or T or G or C and then this pattern repeated for 5 times.

I have FASTA file of the organism that I work with and I have basic knowledge of Pythonand regex. Is there a package or library that does the task or should I write whole code for myself. Initially I only want to know how many of the pattern exist in the reference sequence, but later it will be beneficial to know the start and stop positions as well.

Thanks for help in advance!

pattern genome • 5.6k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 9.5 years ago by Parham ★ 1.6k

Ram · Answer 1 · 2016-01-22

2

Entering edit mode

9.5 years ago

GouthamAtla 12k

Here is a simple template script that prints the coordinates of the matching pattern. This finds the pattern only ones, in a bed format. You could explore it more.

https://gist.github.com/gouthamatla/066f3607b5f96012b4dc

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by GouthamAtla 12k

0

Entering edit mode

Thanks for sharing it.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Parham ★ 1.6k

Ram · Answer 2 · 2016-01-23

1

Entering edit mode

9.5 years ago

Michael 56k

FIMO seems to fit and can use regex and WM as input.

ADD COMMENT • link 9.5 years ago by Michael 56k

0

Entering edit mode

I cannot figure out how it uses regex. Every regex function that I use the motif becomes red which means its not acceptable. I am trying to Make motif of three Gs and then up to 20 nucleotids any thing (ACGT) and then 4 Gs again.

But it seems I cannot write something like G{3}N{1,20}G{3}. Do you know what I am missing?

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Parham ★ 1.6k

1

Entering edit mode

Maybe nothing, I think fimo doesn't support extended posix expressions.

Dreg from EMBOSS supports PCRE expressions as a command line program, otherwise perl, python, php do all provide extended regular expressions. Here is a web server.

Note, dreg and standard pcre do not know about ambiguity codes, so you have to say [ACGT] if you want to match all nucleotides or [ACGTNYRW ...] or simply . if your sequence contains ambiguity codes itself.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Michael 56k

0

Entering edit mode

Dreg is great. I used the command line version and it does what I need. Thanks for recommending it.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Parham ★ 1.6k

0

Entering edit mode

Regexes are explicit: 'N' doesn't mean any nucleotide, it means the character N. Try '[ACGT]' instead.

ADD REPLY • link 9.5 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

Indeed fimo can interpret IUPAC ambiguity codes correctly, while PCRE based programs do not. Fimo doesn't support the {1,20} occurrence range options of PCREs though.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Michael 56k

0

Entering edit mode

No I didn't use N for regex. N is used instead of any base according to FIMO manual. Cheers!

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Parham ★ 1.6k

score 0 · Answer 3 · 2021-12-02

The pattern matching tool offered by the Saccharomyces Genome Database (SGD) and other genome sites has PatMatch as the basis.

The Saccharomyces Genome Database (SGD) has a nice, concise guide to the syntax for PatMatch patterns . PatMatch patterns allow use of N or X or . as any residue or base, and thus are more familiar to biologists than regular expressions. PatMatch allows use of IUPAC ambiguity codes.

You can run the PatMatch software yourself and I have a Github repository where you can easily launch environments served via the MyBinder.org service with PatMatch already installed . The launched sessions include several notebooks demonstrating how to use it with any genome sequence you can provide, as well as how to combine PatMatch results with Python for downstream analysis. Go to my patmatch-binder repo, click on the launch binder badge, and work through the Jupyter notebooks once the session launches.