How To Find Polya Signals Infromation For Chimp Genome?
2
1
Entering edit mode
14.4 years ago
Jaydon ▴ 40

How can I construct a polyA (e.g AATAAA)signal coordinate file and view them in genome browser?

Thanks

• 3.1k views
ADD COMMENT
0
Entering edit mode

Do you already have the coordinates for the polyAs? or are you asking for a way to calculate them?

ADD REPLY
0
Entering edit mode

No, I just want to create a PolyA track using those polyA signals in quick and dirty way.

ADD REPLY
3
Entering edit mode
14.4 years ago

OK, this is not really a answer to your question as I will use a perfect match algorithm (the signal will be either full of 'A' OR 'T'), but I found it fun to solve this problem using GNU Flex, the "Fast Lexical Analyzer". This following simple lex file just define a regex of for a polyA/T signal of at least 20 bases.

%{

#include <ctype.h>
#include <stdio.h>
#include <string.h>
int offset=0;
char name[BUFSIZ];
%}
%option noyywrap

POLYA ([aA][aA\n ]{20,}|[tT][tT\n]{20,})

%%
>.*\n   {
    offset=0;
    strncpy(name,&yytext[1],yyleng-2);
    }

{POLYA} {
    int i;
    int j=0;
    int len=0;
    for(i=0;i< yyleng;++i)
        {
        if(!isspace(yytext[i])) ++len;
        }
    printf("%s\t%d\t%d\t%d\n",name,offset,offset+len,len);
    offset+=len;
    }

.   {
    offset++;
    }
\n

%%

int main(int argc,char** argv)
    {
    name[0]=0;
    yylex();
    return 0;
    }

Compilation:

flex polya.l
gcc -O3 -Wall lex.yy.c

Example:

curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr1.fa.gz" |\
gunzip -c | ./a.out  | head -n 20
chr1    67037   67058   21
chr1    71996   72017   21
chr1    84687   84710   23
chr1    88955   88979   24
chr1    123928  123957  29
chr1    154536  154558  22
chr1    155150  155173  23
chr1    164765  164796  31
chr1    244723  244745  22
chr1    319122  319147  25
chr1    412423  412451  28
chr1    650294  650323  29
chr1    701083  701103  20
chr1    785421  785454  33
chr1    824267  824288  21
chr1    832781  832805  24
chr1    868769  868791  22
chr1    932989  933023  34
chr1    974010  974032  22
chr1    1027314 1027339 25
ADD COMMENT
0
Entering edit mode

This is not an easy problem to solve due to having to match the pattern in a huge target string. This solution is very neat.

ADD REPLY
0
Entering edit mode

Thanks for your input Pierre. It sounds good!

ADD REPLY
1
Entering edit mode
14.4 years ago

Here is another possible solution in python with the pyfasta module. It will find starting locations that initiate a sequence of As or Ts, but unlike Pierre's solution it will not print the maximal such length. A little more fiddling in the while loop would do that; for now let's keep it simple:

import re
from pyfasta import Fasta

fast = Fasta('chr1.fa')
chr1 = fast['chr1']
patt = re.compile('A{20,}' + '|' + 'T{20,}')
size = 20

stream = enumerate(chr1)

for index, base in stream:
    sub = chr1[index:index+size]
    if patt.match(sub.upper()):
        print 'chr1\t%s\t%s' % (index, index+size)
        while base.upper() in 'AT':
              index, base = stream.next()

PS: you can also modify the regular expression to match more cases

ialbert@borg ~/temp
$ python polya.py
chr1    67037   67057
chr1    71996   72016
chr1    84687   84707
chr1    88955   88975
chr1    123928  123948
chr1    154536  154556
chr1    155150  155170
chr1    164765  164785
chr1    244723  244743
ADD COMMENT
0
Entering edit mode

Thanks Istvan! I will try using your approach too.

ADD REPLY

Login before adding your answer.

Traffic: 2649 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6