Bulk extraction of cis-elements (TFBS) from Softberry output
1
0
Entering edit mode
4.1 years ago
venura ▴ 70

Hi,

I have a document (its 100 pages long and only two instances were displayed below) with the following output came from NSite (softberry);

    QUERY: STBZIP38
     Length of Query Sequence:       2000 bp     | Nucleotide Frequencies:  A -  0.34   G -  0.16   T -  0.35   C -  0.15


     TFBS AC: RSP00073//OS: tobacco (Nicotiana tabacum) /GENE: synthetic oligonucleotides/TFBS: PA /BF: TAF-1
     Motifs on "+" Strand: Mean Exp. Number   0.00391     Up.Conf.Int.  1     Found   1
         421  tCCACGTGGC      430 (Mism.= 1)

     Motifs on "-" Strand: Mean Exp. Number   0.00391     Up.Conf.Int.  1     Found   1
         430  GCCACGTGGa      421 (Mism.= 1)

     TFBS AC: RSP00153//OS: Parsley, Petroselinum crispum /GENE: CHS/TFBS: Box II /BF: CPRF-1; CPRF-2; CPRF-3;
     Motifs on "+" Strand: Mean Exp. Number   0.00358     Up.Conf.Int.  1     Found   1
         422  CCACGTGGCa      431 (Mism.= 1)

     TFBS AC: RSP00154//OS: parsley (Petroselinum crispum) /GENE: CHS/TFBS: ACE (CHS) /BF: bZIP factors CPRF1, CPRF4
     Motifs on "+" Strand: Mean Exp. Number   0.00358     Up.Conf.Int.  1     Found   1
         422  CCACGTGGCa      431 (Mism.= 1)
Totally      50 motifs of    43 different TFBSs have been found
____________________________________________________________

 QUERY: STBZIP17
 Length of Query Sequence:       2000 bp     | Nucleotide Frequencies:  A -  0.37   G -  0.13   T -  0.39   C -  0.11


 TFBS AC: RSP00577//OS: tomato (Lycopersicon esculentum), Lycopersicon esculentum /GENE: rbcS3A/TFBS: AT-rich FF2 /BF: unknown nuclear factor
 Motifs on "-" Strand: Mean Exp. Number   0.00187     Up.Conf.Int.  1     Found   1
     206  AATAATTAaAcATTAATTAA      187 (Mism.= 2)

 TFBS AC: RSP00797//OS: potato (Solanum tuberosum) /GENE: patatin 21/TFBS: SURE-1 /BF: SURF
 Motifs on "-" Strand: Mean Exp. Number   0.00440     Up.Conf.Int.  1     Found   1
    1027  TAAAGAATAaAAAAAaaAA     1009 (Mism.= 3)

 TFBS AC: RSP00864//OS: arabidopsis (Arabidopsis thaliana) /GENE: STK/TFBS: GA-5 /BF: BPC1
 Motifs on "-" Strand: Mean Exp. Number   0.00260     Up.Conf.Int.  1     Found   1
    1966  AGAGAGAGA     1958 (Mism.= 0)

The output I want is as follows;

STBZIP38    RSP00073//OS
STBZIP38    RSP00153//OS
STBZIP38    RSP00154//OS
STBZIP17    RSP00577//OS
STBZIP17    RSP00797//OS
STBZIP17    RSP00864//OS

First I tried with Regex (in python) with help from folks at StackOverflow. But to achieve the overall task it will need a very very long code (esp given the fact that the number of TFBS is different from one query to another).

pat1='STB.*\d*'
pat2 = 'RSP.*OS'

m = re.findall(pat1,s)
n = re.findall(pat2, s)

#print(m, n)

print(m[0],  n[0])
print(m[0],  n[1])
print(m[0],  n[2])
print(m[1], n[3]) 
print(m[1],  n[4])
print(m[1],  n[5])

I really appreciate it if someone can help me to come up with either python or bash script. Thanks in advance.

python bash • 765 views
ADD COMMENT
1
Entering edit mode
4.1 years ago
venura ▴ 70

Figured out. I added the code below so f someone wants to do the same.

with open('Softberry.txt') as f:
for line in f:
    if line.startswith(' QUERY:'):
        query = line.split(':', 1)[1].strip()
    if 'AC:' in line:
        ac = line.split('AC:')[1].split(':')[0].strip()
        print(query,ac)
ADD COMMENT

Login before adding your answer.

Traffic: 1741 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6