Question

Enrichment Analysis Of Transcription Factors

2

Entering edit mode

13.0 years ago

Assa Yeroslaviz ★ 1.9k

Hi everybody,

I have a huge text file in this form:

Inspecting sequence ID   chrI:846767-847266

V$ELK1_01             |      254 (+) |  1.000 |  0.885 | aaaacaGGAAGcagga
I$UBX_01              |      142 (+) |  1.000 |  0.992 | ggggcgtTAATGggttact
I$KR_01               |      150 (+) |  1.000 |  0.975 | aatGGGTTac
I$HB_01               |      478 (+) |  1.000 |  0.992 | gcaaAAAAAa
V$E2F_01              |      378 (-) |  1.000 |  0.846 | aatTTTTCgcccaaa
V$E2F_01              |      468 (-) |  1.000 |  0.835 | tttTTTTCgggcaaa
I$HSF_01              |      200 (+) |  1.000 |  1.000 | AGAAA
I$HSF_01              |      210 (+) |  1.000 |  1.000 | AGAAA
I$HSF_01              |      306 (-) |  1.000 |  1.000 | TTTCT
I$HSF_01              |      490 (-) |  1.000 |  1.000 | TTTCT
F$HSF_01              |       30 (+) |  1.000 |  1.000 | AGAAC

Inspecting sequence ID   chrI:1344697-1345196

V$ATF_01               |      157 (+) |  1.000 |  0.976 | ttgTGACGtcagca
V$ATF_01               |      157 (-) |  1.000 |  0.963 | ttgtgaCGTCAgca
V$ATF_01               |      326 (+) |  1.000 |  0.967 | tgcTGACGtcacat
V$ATF_01               |      326 (-) |  1.000 |  0.977 | tgctgaCGTCAcat
I$HSF_01               |      150 (+) |  1.000 |  1.000 | AGAAA
I$HSF_01               |      213 (-) |  1.000 |  1.000 | TTTCT
I$HSF_01               |      343 (-) |  1.000 |  1.000 | TTTCT
F$HSF_01               |      174 (-) |  1.000 |  1.000 | GTTCT
F$HSF_01               |      274 (-) |  1.000 |  1.000 | GTTCT
V$CREBP1CJUN_01        |      160 (+) |  1.000 |  1.000 | tGACGTca
V$CREBP1CJUN_01        |      160 (-) |  1.000 |  1.000 | tgACGTCa

Inspecting sequence ID   chrI:2689476-2689975

I$HB_01                |      368 (+) |  1.000 |  0.984 | ccaaAAAAAa
V$RSRFC4_01            |      254 (+) |  1.000 |  0.906 | agtCTATTtttaattt
I$HSF_01               |        3 (-) |  1.000 |  1.000 | TTTCT
I$HSF_01               |       34 (+) |  1.000 |  1.000 | AGAAA
I$HSF_01               |       96 (+) |  1.000 |  1.000 | AGAAA
V$COMP1_01             |       77 (+) |  1.000 |  0.866 | ccacttGATTGacggaaggagaaa
V$PAX6_01              |      153 (-) |  1.000 |  0.760 | ttcccagcattCGTGAacgtt
V$GFI1_01              |      270 (+) |  1.000 |  0.994 | ttttttcaAATCAcagcaactgag

...

For each of the chromosomal positions I would like to count the occurrences for each of the motifs on the left side.

The results should be something like:

chrI:846767-847266:
V$ELK1_01 -  1
I$UBX_01 -  1
V$E2F_01 - 2
I$HSF_01 - 4 
...

I would like to count the occurrences for each of the motifs in each of the positions and calculate these enrichment of these occurrences against the complete list of motifs.

I wrote a perl script to split the big file into separate single files for each position and a second script to count the occurrences in each of the files ( this one needs to be ran separately for each of the single position files (which is unfortunately very ineffective).

Is there a way to parse this huge txt file in R to calculate the numbers for motifs per position?

I would appreciate any help you can give me.

thanks

Assa

parsing • 3.3k views

ADD COMMENT • link updated 13.0 years ago by Woa ★ 2.9k • written 13.0 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Am I right to assume this is an output you created from Transfac Match? How did you get it in tab-delimited text like the above format?

ADD REPLY • link 13.0 years ago by UnivStudent ▴ 440

0

Entering edit mode

Yes, this is the command line output of match. A tab-delimited format was easy enough with a text editor.

ADD REPLY • link 13.0 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Thanks! I'll try it on the command line, was using the online client and it gives a poorly formatted output.

ADD REPLY • link 13.0 years ago by UnivStudent ▴ 440

score 3 · Answer 1 · 2012-07-24

in perl:

 #!/usr/bin/perl
 # countMotifs.pl

 use strict;
 use warnings;
 my %counts = ();

 while (<>) {
       chomp;
       if (m/^Inspecting sequence ID\s+(.+)/) {
             my $id = $1;
             while (my ($motif, $total) = each %counts) {
                  print "$motif - $total\n";
             }
             print $id, ":\n";
             %counts = (); # restart counts
       }
       elsif (m/(.+?)\s/) {
            $counts{$1}++;
       }
       else {
             # empty lines
        }
  }

Then you can parse line per line your file, the time depends how fast do you read the file (a few seconds in many cases).

perl countMotifs.pl < raw.data > counts.txt

and can work with compressed files:

gunzip -c raw.data.gz | perl countMotifs.pl > counts.txt

score 2 · Answer 2 · 2012-07-24

for kicks, in awk :)

$ awk '{
            if ($1 == "Inspecting") {
                if (header) {
                    print header; 
                    for (i in counts) 
                        print i" - "counts[i] 
                    delete counts; 
                }
                header=$4;
            } else if ($1 != "") {
                counts[$1]++
            }
          }; 
          END {
                  print header; 
                  for (i in counts) 
                      print i" - "counts[i]
         }' in_file.txt > out_file.txt

or python:

import sys

def gen_block(stream):
    block = []
    header = stream.readline().strip().split()[-1]                                                                 
    for line in stream:                                                                                            
        line = line.strip()                                                                                        
        if line.startswith('Inspecting'):                                                                          
            yield header, block                                                                                    
            header = line.split()[-1]                                                                              
            block = []                                                                                             
        elif len(line):                                                                                            
            block.append(line.split()[0].strip())                                                                  
    yield header, block

def count_by_block(filename=None):                                                                                 
    o_file = sys.stdin if not filename else open(filename)                                                         
    for header, block in gen_block(o_file):                                                                        
        print header                                                                                               
        for c in set(block): 
            print c, '-', block.count(c)                                                                           

if __name__ == '__main__':
    count_by_block((sys.argv[1] if len(sys.argv) == 2 else None))

in either, you may use zcat or gzip -cd to stream in a compressed file.

score 1 · Answer 3 · 2012-07-24

Here is a pretty dirty python script you can use on your huge text file (not on the single files):

import sys

inFile = open(sys.argv[1],'r')

block = []
header = inFile.next().strip().split()[-1]
for line in inFile:
    if line.strip() != '':
        data = line.strip().split()
        if data[0] == "Inspecting":
            print header
            count = {}
            for d in block:
                if not count.has_key(d[0]):
                    count[d[0]] = 1
                else:
                    count[d[0]] += 1

            for motif, num in count.items():
                print motif + "\t" + str(num)
            block = []
            header = data[-1]
        else:
            block.append(data)

print header
count = {}
for d in block:
    if not count.has_key(d[0]):
        count[d[0]] = 1
    else:
        count[d[0]] += 1

for motif, num in count.items():
    print motif + "\t" + str(num)

save as yourName.py and use it by: python yourName.py yourHugeTextFile.txt

score 1 · Answer 4 · 2012-07-25

Another Perl hack, but as this 'slurps' the input file for a big file it may crash or become too slow

use strict;
use warnings;
my $pattern="Inspecting sequence ID   ";
my $filename="mydata.txt";#Big data file
my @chunks=();
open FH,$filename||die($!);
{
    local $/ = undef;
    @chunks = split(/$pattern/m, <FH>);
}
close FH||die($!);
shift @chunks;
foreach my $i(@chunks){
    my @lines=grep{defined $_}split(/\n+/,$i);
    my $chr=shift(@lines);
    my %motif_count=();
    map{my ($motif)=split(/\s+/,$_);$motif_count{$motif}++;}@lines;
    print $chr,"\n";
    map{print $_," - ",$motif_count{$_},"\n";}keys %motif_count;
    print "\n";
}

Output:

chrI:846767-847266
I$HB_01 - 1
I$UBX_01 - 1
I$KR_01 - 1
V$ELK1_01 - 1
V$E2F_01 - 2
F$HSF_01 - 1
I$HSF_01 - 4

chrI:1344697-1345196
V$ATF_01 - 4
V$CREBP1CJUN_01 - 2
F$HSF_01 - 2
I$HSF_01 - 3

chrI:2689476-2689975
I$HB_01 - 1
V$GFI1_01 - 1
V$COMP1_01 - 1
V$RSRFC4_01 - 1
V$PAX6_01 - 1
I$HSF_01 - 3