Question

How to get transcripts and TSS

0

Entering edit mode

10.4 years ago

Affan ▴ 310

I have TF binding sites in the format:

>chr1:6585537-6585547(-)
------ctatttatag-------
>chr1:228916512-228916522(+)
------ctatttatag-------
>chr12:50731227-50731237(+)
------ctatttatag-------
>chr13:49597702-49597712(+)
------ctatttatag-------
>chr14:103058980-103058990(+)
------ctatttatag-------

What I want to do is find the gene/promoter region for each one of these entries. The transcription factor is the MEF2 family.

Would the best way to do this is to just get a 1000bp neighbourhood around the transcription factor binding site?

Background: I am looking to scan these regions with my PWM to get its accuracy. It is not feasible to scan the entire chromosome (let alone, the genome because of the high number of false positives).

promoter • 2.5k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.4 years ago by Affan ▴ 310

Ram · Answer 1 · 2015-02-22

You could convert your TFBS data to BED:

#!/usr/bin/env perl

use strict;
use warnings;

my @line = map { [] } 1..6;
my $idx = 0;
while(<STDIN>) {
    chomp;
    if ($_ =~ '^>') {
        my @elems = split(/[>:\-\\(\\)]/, $_);
        if (!defined $elems[4]) { 
            $elems[4] = "-"; 
        }
        $line[0] = $elems[1];
        $line[1] = $elems[2];
        $line[2] = $elems[3];
        $line[3] = "id-".$idx++;
        $line[4] = 0;
        $line[5] = $elems[4];
        print join("\t", @line)."\n";
    }
}

Usage:

$ ./tfbs2bed.pl < tfbs.txt | sort-bed - > tfbs.bed

Then pad this BED file by whatever window you like, say a roughly 1000 base window that is centered on the binding site - in other words, 500 bases that are up- and downstream of the edges of each binding site:

$ bedops --range 500 --everything tfbs.bed > tfbs_1kpad.bed

Then do set operations to retrieve the genes associated with these windows.

First, depending on the genome you are working with, you might download a set of human (hg19) genes, using gtf2bed to make a BED file, e.g.:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gtf.gz \
    | gunzip -c - \
    | gtf2bed - \
    | grep -w gene - \
    > gencode.v21.genes.bed

Second, you could then look for any of these genes that overlap your TFBS windows:

$ bedops --element-of 1 gencode.v21.genes.bed tfbs_1kpad.bed > genes_overlapping_tfbs_1kpad.bed

As a rough TSS, you might grab the start position of the forward-stranded gene subset and then use bedops to pad the upstream range, say by 1000 bases (you can adjust padding to match the size of the promoter region you are interested in):

$ awk '$6=="+"' genes_overlapping_tfbs_1kpad.bed \
    | awk '{$3=$2+1; print $0}' - \
    | bedops --range 1000:0 --everything - \
    > genes_overlapping_tfbs_1kpad.for.tss.upstream1k.bed

Then repeat this for the reverse strand, using the end position as the transcriptional start site of the gene, padding "downstream" (which is really upstream of the TSS, relative to strand orientation):

$ awk '$6=="-"' genes_overlapping_tfbs_1kpad.bed \
    | awk '{$2=$3-1; print $0}' - \
    | bedops --range 0:1000 --everything - \
    > genes_overlapping_tfbs_1kpad.rev.tss.upstream1k.bed

Then union the two files into one BED file:

$ bedops --everything genes_overlapping_tfbs_1kpad.*.tss.upstream1k.bed > genes_overlapping_tfbs_1kpad.tss.upstream1k.bed

This final file contains 1k regions upstream of the TSS of any genes that overlap an approximately 1k window centered on each MEF2 binding site.