Enriched Regions Search Program
4
6
Entering edit mode
14.7 years ago
Noyk ▴ 70

Hi all, I need to find, in a set of fasta sequences, sequence that has a region of 30 or more Gln and Asn residues per 80-residue window (Gln/Asn-Rich Tracts). Any programs I can use out there? There is a mention about patmatch at this forum, can it be adapted to my purpose?

enrichment sequence programming • 3.6k views
ADD COMMENT
3
Entering edit mode

you are not looking for a motif, you are looking for enrichment in regions of a sequence.

ADD REPLY
0
Entering edit mode

@noyk Care to change the title according to @giovanni's suggestion? Great question :)

ADD REPLY
0
Entering edit mode

thanks.. will try out your suggestions

ADD REPLY
0
Entering edit mode

remember to select the answer that you want to accept, to thank its author.

ADD REPLY
8
Entering edit mode
14.7 years ago
Neilfws 49k

Just for completeness, a BioRuby solution to count Q + N in 80-residue moving window.

require 'rubygems'
require 'bio'

Bio::FlatFile.foreach("myseqs.fa") do |sequence|
  sequence.seq.split("").each_cons(80) do |window|
    puts window.count("Q") + window.count("N")
  end
end

Here, each sequence is split into an array and the each_cons() array method provides the moving window - it returns nil if the sequence length is < 80.

ADD COMMENT
1
Entering edit mode

neat! I love looking at small code snippets

ADD REPLY
1
Entering edit mode

Cool. I dont use much ruby, and had never heard of each_cons. a decent summary: http://www.themomorohoax.com/2009/01/29/ruby-each-cons

Isn't that going to print the count for every possible 80 bp window even if it's zero?

ADD REPLY
0
Entering edit mode

Yes, this snippet would print the count for each window. You'd need to add a conditional (just another couple of lines), to print only if at least one of the window counts is >= 30.

ADD REPLY
6
Entering edit mode
14.7 years ago
brentp 24k

If you use numpy, it's easy and fast to take an 80bp window using convolve

The script below assumes you're giving it a protein fasta and prints out sequences that are "enriched", but if you need the exact position where the enrichment occurs within each sequence, you'll need to add a couple lines (lookup numpy.where). runs in a few seconds for a 4.5MB file.

from pyfasta import Fasta
import numpy as np

fasta = 'orf_trans.fasta'
window = 80
cutoff = 30

f = Fasta(fasta)
kern = np.ones((window,))

enriched = 0
for k in f.keys():
    seq = np.array(f[k], dtype='c')
    # get a boolean array with True where AA is one of 'Q', 'N'
    a = (seq == 'N') | (seq == 'Q')

    # take an 80bp moving window.
    mw = np.convolve(a, kern, mode='same')
    # if it doesn't have any windows > 30, skip it.
    if mw.max() < cutoff: continue
    enriched += 1
    # can also add something here to find the exact loc of the enrichement(s)
    print k.split()[0]

print "\n%s out of %i sequences are enriched" % (enriched , len(f.keys()))
ADD COMMENT
0
Entering edit mode

beautiful solution - numpy is amazing

ADD REPLY
4
Entering edit mode
14.7 years ago

Well, a very didactic python solution would be:

import re
seq = 'ACGGTCGIWYDEAJCFSHDKANNNNNQNQNQNQNQNDGASDGADJHDAGJDAQNNAQHKJHSHDDAAHHQNNQNQ'
score_by_position = {}
gln_or_asn_matcher = re.compile('[QN]')
step = 10 # replace it with 80 on the real seq
cutoff = 5 # replace it with 30

# note that this ignores all windows smaller than $step at the left and right of the sequence    
for pos in range(0, len(seq)-step):
    score = len(gln_or_asn_matcher.findall(seq[pos:pos+step]))
    score_by_position[pos] = score
    if score > cutoff:
        print 'possible enriched region at position %s; %s' % (pos, seq[pos:pos+step])

# requires matplotlib
from matplotlib.pylab import bar, show
bar(score_by_position.keys(), score_by_position.values())
xticks(range(len(seq)), seq)
show()

of course, there it should be other tools that do this and in a more secure way. you should test this script with other values before using it.

ADD COMMENT
0
Entering edit mode

Just a minor correction:

from matplotlib.pylab import bar, show, xticks

It is generally not good to import individual functions from a module.

You can do:

import matplotlib.pyplot as plt
plt.bar(score_by_position.keys(), score_by_position.values())
plt.xticks(range(len(seq)), seq)
plt.show()
ADD REPLY
0
Entering edit mode

I wonder whether getting all the match positions into a list first (for the whole sequence), then looking at 80 bp windows across these indices would be faster on large scale (handily these are now in sorted order), that would avoid repeatedly invoking the findall. I am on the road so I can't try this out, will see if I get to it.

ADD REPLY
3
Entering edit mode
14.7 years ago

Here is my java solution:

File 'SearchGlnAsn.java':

import java.util.LinkedList;

public class SearchGlnAsn
    {
    public static void main(String[] args) throws Exception
        {
        int c;
        int pos=0;
        int count_N_Q=0;
        LinkedList&lt;Boolean&gt; window=new LinkedList&lt;Boolean&gt;();
        StringBuilder name=new StringBuilder();
        while((c=System.in.read())!=-1)
            {
            if(c=='&gt;')
                {
                name.setLength(0);
                while((c=System.in.read())!='\n') name.append((char)c);
                pos=0;
                count_N_Q=0;
                window.clear();
                }
            else if(!Character.isWhitespace(c))
                {
                pos++;
                boolean isAA=(c=='N' || c=='Q' || c=='n' || c=='q');
                if(isAA) count_N_Q++;
                window.add(isAA);
                if(window.size()==80)
                    {
                    if(count_N_Q&gt;30)
                        {
                        System.out.println(name+"\t"+count_N_Q+"\t"+(pos-80));
                        }
                    if(window.remove()) count_N_Q--;
                    }

                }
            }

        }

    }

Compilation:

javac SearchGlnAsn.java

Execution:

curl -s ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz |\
gunzip -c |\
java SearchGlnAsn |\
awk -F '    ' '{if($2>75) print $0; }'

sp|Q156A1|ATX8_HUMAN Ataxin-8 OS=Homo sapiens GN=ATXN8 PE=1 SV=1    79  0
sp|Q54UG3|Y1095_DICDI UPF0746 protein DDB_G0281095 OS=Dictyostelium discoideum GN=DDB_G0281095 PE=3 SV=1    76  415

 curl "http://www.uniprot.org/uniprot/Q156A1.fasta"
>sp|Q156A1|ATX8_HUMAN Ataxin-8 OS=Homo sapiens GN=ATXN8 PE=1 SV=1
MQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
QQQQQQQQQQQQQQQQQQQQ
ADD COMMENT

Login before adding your answer.

Traffic: 926 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6