Searching promoter consensus in a gene
1
0
Entering edit mode
9.4 years ago
Ritvik ▴ 30

Hi,

Can anyone please tell me how to search for the following promoter consensus (published in a paper) in a set of genes:

-35 position                                         -10 position
   NGTGG                   N(14-18)                     NNGNNG

Should I just search for NGTGG and NNGNNG at -35 and -10 positions respectively or should I first search for NNGNNG at -10 then use a sliding window of N(14-18) and next search for the occurrence of NGTGG each time?

In short, are the consensus sequences found strictly at positions -35 and -10? If these positions are fixed, then how come -10 and -35 positions be spaced by a variable number of nucleotides?

promoter consensus • 2.8k views
ADD COMMENT
4
Entering edit mode
9.4 years ago

First, you could build an indexed set of FASTA files for all of your chromosomes using samtools.

Second, you can use tools in the BEDOPS suite (for example) along with awk to generate a BED file containing coordinates for regions upstream of genes ("promoters") and retrieve FASTA sequences of the upstream regions.

Third, you can use these sequences with a discovery tool like MEME or Homer to search for motifs, or WebLogo to generate sequence logos.

1) Say you are working with hg19. Create a work directory where your FASTA files and index files will go:

$ cd /foo/bar/baz
$ wget 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/*.fa.gz'
$ for fn in `ls *.fa.gz`; do gunzip $fn; done
$ for fn in `ls *.fa`; do samtools faidx $fn; done

2) Use tools like BEDOPS gff2bed and bedops to convert gene annotations to BED format, and retrieve regions up to 35 bases upstream of the gene start position. Your annotations file would come from GENCODE, RefSeq, etc.

For example:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
    | gunzip --stdout - \
    | gff2bed - \
    | grep -w 'gene' - \
    | cut -f1-6 - \
    > genes.bed

You might even take a smaller subset of annotations, depending on your genes of interest, such as what is described in your publication of interest.

Then split the master list of gene elements by strand:

$ awk '$6=="+"' genes.bed > genes.for.bed
$ awk '$6=="-"' genes.bed > genes.rev.bed

Then do bedops --range operations in a stranded manner to get upstream promoter regions:

$ awk '{print $1"\t"$2"\t"($2+1)"\t"$4"\t"$5"\t"$6}' genes.for.bed | bedops --range -35:0 --everything - > genes.for.start.upstream.bed
$ awk '{print $1"\t"$3"\t"($3+1)"\t"$4"\t"$5"\t"$6}' genes.rev.bed | bedops --range 0:35 --everything - > genes.rev.start.upstream.bed

We need to do per-strand operations, because the start of a gene on a reverse-stranded annotation is usually at the stop position of a BED element. What is upstream of a reverse-stranded annotation is downstream of a forward-stranded element, and vice versa.

Now that we have the upstream regions, we can convert them to FASTA via a Perl script that calls samtools against the indexed reference sequence.

Here is an example of such a helper Perl script:

#!/usr/bin/env perl

use strict;
use warnings;

# fastaDir contains per-chromosome UCSC FASTA (.fa) files and samtools-indexed index (.fai) files
my $fastaDir = "/foo/bar/baz"; 

while (<STDIN>) {
    chomp;
    my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
    my $queryKey = "$chr:$start-$stop";
    my $queryResult = `samtools faidx $fastaDir/$chr.fa $queryKey`; chomp $queryResult;
    my @lines = split("\n", $queryResult);
    my @seqs = @lines[1..(scalar @lines - 1)];
    my $seq = join("", @seqs);
    if ($strand eq "-") { $seq = revdnacomp($seq); }
    my $header = ">".join(":",($chr, $start, $stop, $id, $score, $strand));
    print STDOUT $header."\n".uc($seq)."\n";
}

sub revdnacomp {
    my $dna = shift @_;
    my $revcomp = reverse($dna);
    $revcomp =~ tr/ACGTacgt/TGCAtgca/;
    return $revcomp;
}

You might run this script like so:

$ bed2fasta.pl < genes.for.start.upstream.bed > genes.for.start.upstream.bed.fa
$ bed2fasta.pl < genes.rev.start.upstream.bed > genes.rev.start.upstream.bed.fa

3) Now that you have stranded FASTA files for your sequences, you can run them through a motif discovery tool like MEME or Homer. Or you can quickly see if there are high-information consensus sequences with a tool like WebLogo.

ADD COMMENT
0
Entering edit mode

Thanks a lot for the in depth answer, i will sure try this. Actually, i just wanted to know theoretically, if it's possible to do so using a simple string search.

ADD REPLY
0
Entering edit mode

Sure. If you already have all the strings ("sequences") you can just run them through WebLogo or similar to build consensus strings.

ADD REPLY
0
Entering edit mode

Actually I already have the consensus as I mentioned in my question, which I just want to search in a gene using string search. The consensus I have is like this (from a paper):

-35 position                                                 -10 position
NGTGG                   N(14-18)                       NNGNNG

I am just confused as to how to read this consensus, the -35 consensus is of 5 letters then till -10 position, there should be 20 nucleotides in between so how come the distance between the two consensuses can vary from N(14-18), by my understanding it should be a fixed distance of 20 nucleotides only?

ADD REPLY
0
Entering edit mode

I suggest to use rsync for the download of the hg19 FASTA files and index files,instead of wget:
​rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ .

ADD REPLY

Login before adding your answer.

Traffic: 1641 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6