Entering edit mode
4.1 years ago
venura
▴
70
Hi,
I have a document (its 100 pages long and only two instances were displayed below) with the following output came from NSite (softberry);
QUERY: STBZIP38
Length of Query Sequence: 2000 bp | Nucleotide Frequencies: A - 0.34 G - 0.16 T - 0.35 C - 0.15
TFBS AC: RSP00073//OS: tobacco (Nicotiana tabacum) /GENE: synthetic oligonucleotides/TFBS: PA /BF: TAF-1
Motifs on "+" Strand: Mean Exp. Number 0.00391 Up.Conf.Int. 1 Found 1
421 tCCACGTGGC 430 (Mism.= 1)
Motifs on "-" Strand: Mean Exp. Number 0.00391 Up.Conf.Int. 1 Found 1
430 GCCACGTGGa 421 (Mism.= 1)
TFBS AC: RSP00153//OS: Parsley, Petroselinum crispum /GENE: CHS/TFBS: Box II /BF: CPRF-1; CPRF-2; CPRF-3;
Motifs on "+" Strand: Mean Exp. Number 0.00358 Up.Conf.Int. 1 Found 1
422 CCACGTGGCa 431 (Mism.= 1)
TFBS AC: RSP00154//OS: parsley (Petroselinum crispum) /GENE: CHS/TFBS: ACE (CHS) /BF: bZIP factors CPRF1, CPRF4
Motifs on "+" Strand: Mean Exp. Number 0.00358 Up.Conf.Int. 1 Found 1
422 CCACGTGGCa 431 (Mism.= 1)
Totally 50 motifs of 43 different TFBSs have been found
____________________________________________________________
QUERY: STBZIP17
Length of Query Sequence: 2000 bp | Nucleotide Frequencies: A - 0.37 G - 0.13 T - 0.39 C - 0.11
TFBS AC: RSP00577//OS: tomato (Lycopersicon esculentum), Lycopersicon esculentum /GENE: rbcS3A/TFBS: AT-rich FF2 /BF: unknown nuclear factor
Motifs on "-" Strand: Mean Exp. Number 0.00187 Up.Conf.Int. 1 Found 1
206 AATAATTAaAcATTAATTAA 187 (Mism.= 2)
TFBS AC: RSP00797//OS: potato (Solanum tuberosum) /GENE: patatin 21/TFBS: SURE-1 /BF: SURF
Motifs on "-" Strand: Mean Exp. Number 0.00440 Up.Conf.Int. 1 Found 1
1027 TAAAGAATAaAAAAAaaAA 1009 (Mism.= 3)
TFBS AC: RSP00864//OS: arabidopsis (Arabidopsis thaliana) /GENE: STK/TFBS: GA-5 /BF: BPC1
Motifs on "-" Strand: Mean Exp. Number 0.00260 Up.Conf.Int. 1 Found 1
1966 AGAGAGAGA 1958 (Mism.= 0)
The output I want is as follows;
STBZIP38 RSP00073//OS
STBZIP38 RSP00153//OS
STBZIP38 RSP00154//OS
STBZIP17 RSP00577//OS
STBZIP17 RSP00797//OS
STBZIP17 RSP00864//OS
First I tried with Regex (in python) with help from folks at StackOverflow. But to achieve the overall task it will need a very very long code (esp given the fact that the number of TFBS is different from one query to another).
pat1='STB.*\d*'
pat2 = 'RSP.*OS'
m = re.findall(pat1,s)
n = re.findall(pat2, s)
#print(m, n)
print(m[0], n[0])
print(m[0], n[1])
print(m[0], n[2])
print(m[1], n[3])
print(m[1], n[4])
print(m[1], n[5])
I really appreciate it if someone can help me to come up with either python or bash script. Thanks in advance.