Question

Analyze Promoter Regions In A Vcf File

0

Entering edit mode

12.3 years ago

farah ▴ 30

I have a VCF file. My aim is to check promoter variants within this file. I have got a list of all the promoter regions in a fasta format (supposedly, this file has a chromosome number, end and start position of the prmoter region which i defined as 2000bp upstream). how do I proceed? any help?

promoter vcf • 4.0k views

ADD COMMENT • link updated 12.3 years ago by Alex Reynolds 36k • written 12.3 years ago by farah ▴ 30

Ram · Answer 1 · 2013-03-01

2

Entering edit mode

12.3 years ago

fo3c ▴ 450

Assuming the calls in the vcf file have been filtered and are high-confidence calls, extract the intervals from the fasta file (headers) and intersect them with the vcf file to see what mutations, if any, have been called in the promoter regions.

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 12.3 years ago by fo3c ▴ 450

Ram · Answer 2 · 2013-03-01

Use Bedtools intersect or intersectbed or windowBed depending on your input

bedtools intersect -a promotor.vcf -b input.VCF -f 1 -wao

(you can use -wa or -wb etc accordingly)

Info from Bedtools:

Note: When intersecting SNPs, make sure the coordinate conform to the UCSC format. That is, the start position for each SNP should be SNP position - 1 and the end position should be SNP position. E.g. chr7 10000001 10000002 rs123464

Report the base-pair overlap between sequence alignments and genes.

intersectBed -a reads.bed -b genes.bed

If you have Gene cordinates as bed file: to Report all SNPs that are within 5000 bp upstream (-l) or 1000 bp downstream (-r) of genes. Define upstream and downstream based on strand (-sw).

windowBed -a genes.bed –b snps.bed –l 5000 –r 1000 -sw

Ram · Answer 3 · 2013-03-01

Convert the FASTA-formatted promoter regions to a UCSC-formatted BED file and use a tool like BEDOPS bedmap to map promoter regions that overlap your VCF calls. A discussion follows which explains how you might approach this.

To start, let's say your promoter regions are formatted as follows:

>promoter1 chrN:A-B
ACGTACG...TACAGT
>promoter2 chrN:C-D
...

Convert this to a five-column BED file, where the chromosome, start, stop and ID columns are taken from the FASTA header, and a dummy score value is added for convenience. For example:

#!/usr/bin/env perl

use strict;
use warnings;

while (<>) {
  if ($_ =~ /^>/) {
    chomp;
    my $ln = $_;
    $ln =~ s/^>//s;
    my ($id, $chr, $start, $stop) = split(/[\s+|:|-]/, $ln);
    my $score = 0;
    print STDOUT join("\t", ($chr, $start, $stop, $id, $score))."\n";
  }
}

Convert the FASTA data to a sorted BED file with a command like:

$ ./fa2bed.pl < myPromoters.fa | sort-bed - > myPromoters.bed

This is a "map" file that contains promoter regions, IDs and (placeholder) scores that we can use as an input to bedmap:

$ more myPromoters.bed
chrN    A    B    promoter1    0
chrN    C    D    promoter2    0
...

We're now ready to answer the question. We'll first convert your VCF calls to 0-indexed BED using the vcf2bed conversion script. We pipe the conversion results into sort-bed to sort them, and then pipe that into bedmap (along with myPromoters.bed) to do the actual mapping:

$ vcf2bed < myCalls.vcf \
    | sort-bed - \
    | bedmap --echo --echo-map-id --delim '\t' - myPromoters.bed \
    > myAnswer.bed

The result (myAnswer.bed) will be a BED-formatted file. The first four columns will be the (0-indexed) positional data and variant ID of the VCF call. The last column (either the eleventh or twelfth column) will be the ID (or IDs) of promoter regions which overlap that call. (If you want the full promoter element — region and ID — use --echo-map in place of --echo-map-id.)

If you need to adjust the boundaries of your definition of a promoter region, you can use BEDOPS bedops to pre-process your myPromoters.bed file. As an example, here is how to use the --range operator to widen the boundaries of each promoter upstream by 10000 bases and downstream by 500 bases, along with the --everything operator to process all elements:

$ bedops --range 10000:500 --everything myPromoters.bed > myWidenedPromoters.bed

Then use the myWidenedPromoters.bed file with mapping operations, etc.