Question

Motif search on a DNA sequence in a gff file (ipdSummary output)

0

Entering edit mode

5.1 years ago

gabri ▴ 60

Hi All,

I used ipdSummary to identify the methylated adenines in my genome assembly. I got a huge gff file. In the seventh column there is a string called "context=nucleotide sequence". In each of these nucleotide sequences, the twenty-first base is always the methylated adenine.

Example:

s1  Cl  64  64  -   .   cv=80;context=ACGTTCTATGAATCCAATTG**A**CCGGTAATGCTTTCGAAGTC;ip=2
s1  Cl  68  68  -   .   cv=87;context=ACGTTCTATGAATCCACCGG**A**TTGATAATGCTTTCGAAGTC;ip=1
s1  Cl  71  71  +   .   cv=91;context=ACGTTCTATGAATCCACCTC**A**AGTTTAATGCTTTCGAAGTC;ip=3
s1  Cl  71  71  -   .   cv=76;context=ACGTTCTATGAATCCACCCA**A**GGTTTAATGCTTTCGAAGTC;ip=2
s1  Cl  74  74  +   .   cv=96;context=AACTATTGGACCAACGATGG**A**GGCCGTAGGTCTTAGTGTGT;ip=2
s1  Cl  74  74  -   .   cv=83;context=AACTATTGGACCAACGATGG**A**CCTCGTAGGTCTTAGTGTGT;ip=2
s1  Cl  76  76  -   .   cv=89;context=AACTATTGGACCAACGAAAA**A**AAAAGTAGGTCTTAGTGTGT;ip=2

I need to search and extract specific motifs in which the methylated adenine is included, possibly adding the motif in a new column to the same gff.

Motif list:

Can be the first or the fifth A, not the middle ones
ATTGA
ATCGA
ATAGA
AATGA
AACGA
AAAGA
AGTGA
AGCGA
AGAGA
ACTGA
ACCGA
ACAGA
1---5

Can be the second or the third A
CAAG
-23-

Can be only the second A
GAGG
-2--

Can be only the first A
ACCT
1---

No motif found
NoMotif

Output:

s1  Cl  64  64  -   .   cv=80;context=ACGTTCTATGAATCCA**ATTGA**CCGGTAATGCTTTCGAAGTC;ip=2    ATTGA
s1  Cl  68  68  -   .   cv=87;context=ACGTTCTATGAATCCACCGG**ATTGA**TAATGCTTTCGAAGTC;ip=1    ATTGA
s1  Cl  71  71  +   .   cv=91;context=ACGTTCTATGAATCCACCT**CAAG**TTTAATGCTTTCGAAGTC;ip=3    CAAG
s1  Cl  71  71  -   .   cv=76;context=ACGTTCTATGAATCCACC**CAAG**GTTTAATGCTTTCGAAGTC;ip=2    CAAG
s1  Cl  74  74  +   .   cv=96;context=AACTATTGGACCAACGATG**GAGG**CCGTAGGTCTTAGTGTGT;ip=2    GAGG
s1  Cl  74  74  -   .   cv=83;context=AACTATTGGACCAACGATGT**ACCT**CGTAGGTCTTAGTGTGT;ip=2    ACCT
s1  Cl  76  76  -   .   cv=89;context=AACTATTGGACCAACG**AAAAAAAAA**GTAGGTCTTAGTGTGT;ip=2    NoMotif

Thank you!

gff motif ipdSummary bash python • 1.5k views

ADD COMMENT • link updated 5.1 years ago by Joe 22k • written 5.1 years ago by gabri ▴ 60

score 3 · Accepted Answer · 2020-07-03

This was a pretty fun one, if a little messy.

Try this:

#!/usr/bin/env python

import sys
import re
from itertools import chain

# This is not strictly necessary, but helps break the regex up over multiple lines
# so it's a bit easier to read:
a___a = "((\*\*)?A(\*\*)?[ATGC][TAC]G(\*\*)?A(\*\*)?)"
_aa_ = "(C(\*\*)?A(\*\*)?(\*\*)?A(\*\*)?G)"
_a__ = "(G\*\*A\*\*GG)"
a___ = "((\*\*)A(\*\*)CCT)"
regex = re.compile("|".join([a___a, _aa_, _a__, a___]))

with open(sys.argv[1], 'r') as fh:
    for line in fh:
        match = [x for x in list(chain(*regex.findall(line))) if "**A**" in x]
        print(line.rstrip("\n") + "\t" + (str(match[0]).replace('*','') if match else "NoMotif"))

There's an annoying amount of filtering to do because findall returns as list of tuples with a load of nonsense in them, but it works. Due to making all the **A**s optional, the regex does also match a string in a few examples with no ** - hence the need for additional filtering, but I think this is probably worthwhile to keep the regexes a bit more simple. There's probably a more elegant solution within the regex, but this is already at my limit pretty much!

$ python methylmotifs.py input.txt

s1  Cl  64  64  -   .   cv=80;context=ACGTTCTATGAATCCAATTG**A**CCGGTAATGCTTTCGAAGTC;ip=2    ATTGA
s1  Cl  68  68  -   .   cv=87;context=ACGTTCTATGAATCCACCGG**A**TTGATAATGCTTTCGAAGTC;ip=1    ATTGA
s1  Cl  71  71  +   .   cv=91;context=ACGTTCTATGAATCCACCTC**A**AGTTTAATGCTTTCGAAGTC;ip=3    CAAG
s1  Cl  71  71  -   .   cv=76;context=ACGTTCTATGAATCCACCCA**A**GGTTTAATGCTTTCGAAGTC;ip=2    CAAG
s1  Cl  74  74  +   .   cv=96;context=AACTATTGGACCAACGATGG**A**GGCCGTAGGTCTTAGTGTGT;ip=2    GAGG
s1  Cl  74  74  -   .   cv=83;context=AACTATTGGACCAACGATGG**A**CCTCGTAGGTCTTAGTGTGT;ip=2    ACCT
s1  Cl  76  76  -   .   cv=89;context=AACTATTGGACCAACGAAAA**A**AAAAGTAGGTCTTAGTGTGT;ip=2    NoMotif

I haven't gone as far as substituting in the **STRING** back in to the sequence field, but that might be a fun little task for you to try too.

At the moment this code will break/ignore cases where there are more than one real matches in any given sequence though. I assume based of your input data that isn't likely to ever be the case, but be aware. Wouldnt be difficult to modify the print statement though.