Hi All,
I used ipdSummary to identify the methylated adenines in my genome assembly. I got a huge gff file. In the seventh column there is a string called "context=nucleotide sequence". In each of these nucleotide sequences, the twenty-first base is always the methylated adenine.
Example:
s1 Cl 64 64 - . cv=80;context=ACGTTCTATGAATCCAATTG**A**CCGGTAATGCTTTCGAAGTC;ip=2
s1 Cl 68 68 - . cv=87;context=ACGTTCTATGAATCCACCGG**A**TTGATAATGCTTTCGAAGTC;ip=1
s1 Cl 71 71 + . cv=91;context=ACGTTCTATGAATCCACCTC**A**AGTTTAATGCTTTCGAAGTC;ip=3
s1 Cl 71 71 - . cv=76;context=ACGTTCTATGAATCCACCCA**A**GGTTTAATGCTTTCGAAGTC;ip=2
s1 Cl 74 74 + . cv=96;context=AACTATTGGACCAACGATGG**A**GGCCGTAGGTCTTAGTGTGT;ip=2
s1 Cl 74 74 - . cv=83;context=AACTATTGGACCAACGATGG**A**CCTCGTAGGTCTTAGTGTGT;ip=2
s1 Cl 76 76 - . cv=89;context=AACTATTGGACCAACGAAAA**A**AAAAGTAGGTCTTAGTGTGT;ip=2
I need to search and extract specific motifs in which the methylated adenine is included, possibly adding the motif in a new column to the same gff.
Motif list:
Can be the first or the fifth A, not the middle ones
ATTGA
ATCGA
ATAGA
AATGA
AACGA
AAAGA
AGTGA
AGCGA
AGAGA
ACTGA
ACCGA
ACAGA
1---5
Can be the second or the third A
CAAG
-23-
Can be only the second A
GAGG
-2--
Can be only the first A
ACCT
1---
No motif found
NoMotif
Output:
s1 Cl 64 64 - . cv=80;context=ACGTTCTATGAATCCA**ATTGA**CCGGTAATGCTTTCGAAGTC;ip=2 ATTGA
s1 Cl 68 68 - . cv=87;context=ACGTTCTATGAATCCACCGG**ATTGA**TAATGCTTTCGAAGTC;ip=1 ATTGA
s1 Cl 71 71 + . cv=91;context=ACGTTCTATGAATCCACCT**CAAG**TTTAATGCTTTCGAAGTC;ip=3 CAAG
s1 Cl 71 71 - . cv=76;context=ACGTTCTATGAATCCACC**CAAG**GTTTAATGCTTTCGAAGTC;ip=2 CAAG
s1 Cl 74 74 + . cv=96;context=AACTATTGGACCAACGATG**GAGG**CCGTAGGTCTTAGTGTGT;ip=2 GAGG
s1 Cl 74 74 - . cv=83;context=AACTATTGGACCAACGATGT**ACCT**CGTAGGTCTTAGTGTGT;ip=2 ACCT
s1 Cl 76 76 - . cv=89;context=AACTATTGGACCAACG**AAAAAAAAA**GTAGGTCTTAGTGTGT;ip=2 NoMotif
Thank you!
Wow thank you so much for the help! It works! You really made my day. I will test it on much larger datasets. Thank you again!
Please do, just be aware that designing robust regexes is a bit of a dark art, so scrutinise your output data very carefully, since I know that RegEx is not as optimal as it perhaps could be.
If the answer solves your question, don't forget to toggle it as Accepted, by the way.