Question

Searching promoter consensus in a gene

0

Entering edit mode

9.4 years ago

Ritvik ▴ 30

Hi,

Can anyone please tell me how to search for the following promoter consensus (published in a paper) in a set of genes:

-35 position                                         -10 position
   NGTGG                   N(14-18)                     NNGNNG

Should I just search for NGTGG and NNGNNG at -35 and -10 positions respectively or should I first search for NNGNNG at -10 then use a sliding window of N(14-18) and next search for the occurrence of NGTGG each time?

In short, are the consensus sequences found strictly at positions -35 and -10? If these positions are fixed, then how come -10 and -35 positions be spaced by a variable number of nucleotides?

promoter consensus • 2.8k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.4 years ago by Ritvik ▴ 30

Ram · Answer 1 · 2015-07-01

First, you could build an indexed set of FASTA files for all of your chromosomes using samtools.

Second, you can use tools in the BEDOPS suite (for example) along with awk to generate a BED file containing coordinates for regions upstream of genes ("promoters") and retrieve FASTA sequences of the upstream regions.

Third, you can use these sequences with a discovery tool like MEME or Homer to search for motifs, or WebLogo to generate sequence logos.

1) Say you are working with hg19. Create a work directory where your FASTA files and index files will go:

$ cd /foo/bar/baz
$ wget 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/*.fa.gz'
$ for fn in `ls *.fa.gz`; do gunzip $fn; done
$ for fn in `ls *.fa`; do samtools faidx $fn; done

2) Use tools like BEDOPS gff2bed and bedops to convert gene annotations to BED format, and retrieve regions up to 35 bases upstream of the gene start position. Your annotations file would come from GENCODE, RefSeq, etc.

For example:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
    | gunzip --stdout - \
    | gff2bed - \
    | grep -w 'gene' - \
    | cut -f1-6 - \
    > genes.bed

You might even take a smaller subset of annotations, depending on your genes of interest, such as what is described in your publication of interest.

Then split the master list of gene elements by strand:

$ awk '$6=="+"' genes.bed > genes.for.bed
$ awk '$6=="-"' genes.bed > genes.rev.bed

Then do bedops --range operations in a stranded manner to get upstream promoter regions:

$ awk '{print $1"\t"$2"\t"($2+1)"\t"$4"\t"$5"\t"$6}' genes.for.bed | bedops --range -35:0 --everything - > genes.for.start.upstream.bed
$ awk '{print $1"\t"$3"\t"($3+1)"\t"$4"\t"$5"\t"$6}' genes.rev.bed | bedops --range 0:35 --everything - > genes.rev.start.upstream.bed

We need to do per-strand operations, because the start of a gene on a reverse-stranded annotation is usually at the stop position of a BED element. What is upstream of a reverse-stranded annotation is downstream of a forward-stranded element, and vice versa.

Now that we have the upstream regions, we can convert them to FASTA via a Perl script that calls samtools against the indexed reference sequence.

Here is an example of such a helper Perl script:

#!/usr/bin/env perl

use strict;
use warnings;

# fastaDir contains per-chromosome UCSC FASTA (.fa) files and samtools-indexed index (.fai) files
my $fastaDir = "/foo/bar/baz"; 

while (<STDIN>) {
    chomp;
    my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
    my $queryKey = "$chr:$start-$stop";
    my $queryResult = `samtools faidx $fastaDir/$chr.fa $queryKey`; chomp $queryResult;
    my @lines = split("\n", $queryResult);
    my @seqs = @lines[1..(scalar @lines - 1)];
    my $seq = join("", @seqs);
    if ($strand eq "-") { $seq = revdnacomp($seq); }
    my $header = ">".join(":",($chr, $start, $stop, $id, $score, $strand));
    print STDOUT $header."\n".uc($seq)."\n";
}

sub revdnacomp {
    my $dna = shift @_;
    my $revcomp = reverse($dna);
    $revcomp =~ tr/ACGTacgt/TGCAtgca/;
    return $revcomp;
}

You might run this script like so:

$ bed2fasta.pl < genes.for.start.upstream.bed > genes.for.start.upstream.bed.fa
$ bed2fasta.pl < genes.rev.start.upstream.bed > genes.rev.start.upstream.bed.fa

3) Now that you have stranded FASTA files for your sequences, you can run them through a motif discovery tool like MEME or Homer. Or you can quickly see if there are high-information consensus sequences with a tool like WebLogo.