Hi everybody,
I have a huge text file in this form:
Inspecting sequence ID chrI:846767-847266
V$ELK1_01 | 254 (+) | 1.000 | 0.885 | aaaacaGGAAGcagga
I$UBX_01 | 142 (+) | 1.000 | 0.992 | ggggcgtTAATGggttact
I$KR_01 | 150 (+) | 1.000 | 0.975 | aatGGGTTac
I$HB_01 | 478 (+) | 1.000 | 0.992 | gcaaAAAAAa
V$E2F_01 | 378 (-) | 1.000 | 0.846 | aatTTTTCgcccaaa
V$E2F_01 | 468 (-) | 1.000 | 0.835 | tttTTTTCgggcaaa
I$HSF_01 | 200 (+) | 1.000 | 1.000 | AGAAA
I$HSF_01 | 210 (+) | 1.000 | 1.000 | AGAAA
I$HSF_01 | 306 (-) | 1.000 | 1.000 | TTTCT
I$HSF_01 | 490 (-) | 1.000 | 1.000 | TTTCT
F$HSF_01 | 30 (+) | 1.000 | 1.000 | AGAAC
Inspecting sequence ID chrI:1344697-1345196
V$ATF_01 | 157 (+) | 1.000 | 0.976 | ttgTGACGtcagca
V$ATF_01 | 157 (-) | 1.000 | 0.963 | ttgtgaCGTCAgca
V$ATF_01 | 326 (+) | 1.000 | 0.967 | tgcTGACGtcacat
V$ATF_01 | 326 (-) | 1.000 | 0.977 | tgctgaCGTCAcat
I$HSF_01 | 150 (+) | 1.000 | 1.000 | AGAAA
I$HSF_01 | 213 (-) | 1.000 | 1.000 | TTTCT
I$HSF_01 | 343 (-) | 1.000 | 1.000 | TTTCT
F$HSF_01 | 174 (-) | 1.000 | 1.000 | GTTCT
F$HSF_01 | 274 (-) | 1.000 | 1.000 | GTTCT
V$CREBP1CJUN_01 | 160 (+) | 1.000 | 1.000 | tGACGTca
V$CREBP1CJUN_01 | 160 (-) | 1.000 | 1.000 | tgACGTCa
Inspecting sequence ID chrI:2689476-2689975
I$HB_01 | 368 (+) | 1.000 | 0.984 | ccaaAAAAAa
V$RSRFC4_01 | 254 (+) | 1.000 | 0.906 | agtCTATTtttaattt
I$HSF_01 | 3 (-) | 1.000 | 1.000 | TTTCT
I$HSF_01 | 34 (+) | 1.000 | 1.000 | AGAAA
I$HSF_01 | 96 (+) | 1.000 | 1.000 | AGAAA
V$COMP1_01 | 77 (+) | 1.000 | 0.866 | ccacttGATTGacggaaggagaaa
V$PAX6_01 | 153 (-) | 1.000 | 0.760 | ttcccagcattCGTGAacgtt
V$GFI1_01 | 270 (+) | 1.000 | 0.994 | ttttttcaAATCAcagcaactgag
...
For each of the chromosomal positions I would like to count the occurrences for each of the motifs on the left side.
The results should be something like:
chrI:846767-847266:
V$ELK1_01 - 1
I$UBX_01 - 1
V$E2F_01 - 2
I$HSF_01 - 4
...
I would like to count the occurrences for each of the motifs in each of the positions and calculate these enrichment of these occurrences against the complete list of motifs.
I wrote a perl script to split the big file into separate single files for each position and a second script to count the occurrences in each of the files ( this one needs to be ran separately for each of the single position files (which is unfortunately very ineffective).
Is there a way to parse this huge txt file in R to calculate the numbers for motifs per position?
I would appreciate any help you can give me.
thanks
Assa
Am I right to assume this is an output you created from Transfac Match? How did you get it in tab-delimited text like the above format?
Yes, this is the command line output of match. A tab-delimited format was easy enough with a text editor.
Thanks! I'll try it on the command line, was using the online client and it gives a poorly formatted output.