First, you could build an indexed set of FASTA files for all of your chromosomes using samtools.
Second, you can use tools in the BEDOPS suite (for example) along with awk to generate a BED file containing coordinates for regions upstream of genes ("promoters") and retrieve FASTA sequences of the upstream regions.
Third, you can use these sequences with a discovery tool like MEME or Homer to search for motifs, or WebLogo to generate sequence logos.
1) Say you are working with hg19
. Create a work directory where your FASTA files and index files will go:
$ cd /foo/bar/baz
$ wget 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/*.fa.gz'
$ for fn in `ls *.fa.gz`; do gunzip $fn; done
$ for fn in `ls *.fa`; do samtools faidx $fn; done
2) Use tools like BEDOPS gff2bed and bedops to convert gene annotations to BED format, and retrieve regions up to 35 bases upstream of the gene start position. Your annotations file would come from GENCODE, RefSeq, etc.
For example:
$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
| gunzip --stdout - \
| gff2bed - \
| grep -w 'gene' - \
| cut -f1-6 - \
> genes.bed
You might even take a smaller subset of annotations, depending on your genes of interest, such as what is described in your publication of interest.
Then split the master list of gene elements by strand:
$ awk '$6=="+"' genes.bed > genes.for.bed
$ awk '$6=="-"' genes.bed > genes.rev.bed
Then do bedops --range operations in a stranded manner to get upstream promoter regions:
$ awk '{print $1"\t"$2"\t"($2+1)"\t"$4"\t"$5"\t"$6}' genes.for.bed | bedops --range -35:0 --everything - > genes.for.start.upstream.bed
$ awk '{print $1"\t"$3"\t"($3+1)"\t"$4"\t"$5"\t"$6}' genes.rev.bed | bedops --range 0:35 --everything - > genes.rev.start.upstream.bed
We need to do per-strand operations, because the start of a gene on a reverse-stranded annotation is usually at the stop position of a BED element. What is upstream of a reverse-stranded annotation is downstream of a forward-stranded element, and vice versa.
Now that we have the upstream regions, we can convert them to FASTA via a Perl script that calls samtools against the indexed reference sequence.
Here is an example of such a helper Perl script:
#!/usr/bin/env perl
use strict;
use warnings;
# fastaDir contains per-chromosome UCSC FASTA (.fa) files and samtools-indexed index (.fai) files
my $fastaDir = "/foo/bar/baz";
while (<STDIN>) {
chomp;
my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
my $queryKey = "$chr:$start-$stop";
my $queryResult = `samtools faidx $fastaDir/$chr.fa $queryKey`; chomp $queryResult;
my @lines = split("\n", $queryResult);
my @seqs = @lines[1..(scalar @lines - 1)];
my $seq = join("", @seqs);
if ($strand eq "-") { $seq = revdnacomp($seq); }
my $header = ">".join(":",($chr, $start, $stop, $id, $score, $strand));
print STDOUT $header."\n".uc($seq)."\n";
}
sub revdnacomp {
my $dna = shift @_;
my $revcomp = reverse($dna);
$revcomp =~ tr/ACGTacgt/TGCAtgca/;
return $revcomp;
}
You might run this script like so:
$ bed2fasta.pl < genes.for.start.upstream.bed > genes.for.start.upstream.bed.fa
$ bed2fasta.pl < genes.rev.start.upstream.bed > genes.rev.start.upstream.bed.fa
3) Now that you have stranded FASTA files for your sequences, you can run them through a motif discovery tool like MEME or Homer. Or you can quickly see if there are high-information consensus sequences with a tool like WebLogo.
Thanks a lot for the in depth answer, i will sure try this. Actually, i just wanted to know theoretically, if it's possible to do so using a simple string search.
Sure. If you already have all the strings ("sequences") you can just run them through WebLogo or similar to build consensus strings.
Actually I already have the consensus as I mentioned in my question, which I just want to search in a gene using string search. The consensus I have is like this (from a paper):
I am just confused as to how to read this consensus, the -35 consensus is of 5 letters then till -10 position, there should be 20 nucleotides in between so how come the distance between the two consensuses can vary from N(14-18), by my understanding it should be a fixed distance of 20 nucleotides only?
I suggest to use rsync for the download of the hg19 FASTA files and index files,instead of wget:
​rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ .