Hello everyone! I have an output from NsiteM which represents a text with promoter analysis results. The text has a regular pattern:
1. QUERY: Gene1
Length of Query Sequence: 500 bp | Nucleotide Frequencies: A - 0.37 G - 0.15 T - 0.27 C - 0.21
TFBS AC: RSP00204//OS: arabidopsis (Arabidopsis thaliana) /GENE: AtEm6/TFBS: ABRE/6.2 /BF: ABI5
Found in 5 (100.00 %) SEQs (out of 5)
Motifs on "+" Strand: Mean Exp. Number 0.00232 Found 1
393 GACACGTGtC 402 (Mism.= 1)
TFBS AC: RSP00219//OS: arabidopsis (Arabidopsis thaliana) /GENE: RBCS-1A/TFBS: G box-1 /BF: HY5; Arabidopsis bZIP protein - transcriptionl factor
Found in 5 (100.00 %) SEQs (out of 5)
Motifs on "+" Strand: Mean Exp. Number 0.00799 Found 1
392 gGACACGTGtCA 403 (Mism.= 2)
Totally 2 motifs of 2 different TFBSs have been found
------------------------------------------------------------
2. QUERY: Gene2
Length of Query Sequence: 500 bp | Nucleotide Frequencies: A - 0.37 G - 0.17 T - 0.26 C - 0.19
TFBS AC: RSP00073//OS: tobacco (Nicotiana tabacum) /GENE: synthetic oligonucleotides/TFBS: PA /BF: TAF-1
Found in 3 ( 60.00 %) SEQs (out of 5)
Motifs on "+" Strand: Mean Exp. Number 0.00239 Found 1
299 GCCACGTGGC 308 (Mism.= 0)
Motifs on "-" Strand: Mean Exp. Number 0.00239 Found 1
308 GCCACGTGGC 299 (Mism.= 0)
TFBS AC: RSP00153//OS: Parsley, Petroselinum crispum /GENE: CHS/TFBS: Box II /BF: CPRF-1; CPRF-2; CPRF-3;
Found in 3 ( 60.00 %) SEQs (out of 5)
Motifs on "+" Strand: Mean Exp. Number 0.00258 Found 1
300 CCACGTGGCa 309 (Mism.= 1)
Motifs on "-" Strand: Mean Exp. Number 0.00221 Found 1
307 CCACGTGGCa 298 (Mism.= 1)
Totally 4 motifs of 2 different TFBSs have been found
------------------------------------------------------------
etc.
So my question is how to extract the data on Cis-regulatory elements from text above into the following table:
QUERY | TFBS AC | Strand | Mean Exp. Number |
--------- | ---------- | -------- | ------------------ |
Gene1 | RSP00204 | + | 0.00232 |
Gene1 | RSP00219 | + | 0.00799 |
Gene2 | RSP00073 | + | 0.00239 |
Gene2 | RSP00073 | - | 0.00239 |
Gene2 | RSP00153 | + | 0.0025 |
Gene2 | RSP00153 | - | 0.00221 |
Is it better to master Python scripting to solve such a problem or it is possible to solve it by sed, for example? Or maybe there are already ready solutions?
Thank you in advance!