Convert the FASTA-formatted promoter regions to a UCSC-formatted BED file and use a tool like BEDOPS bedmap
to map promoter regions that overlap your VCF calls. A discussion follows which explains how you might approach this.
To start, let's say your promoter regions are formatted as follows:
>promoter1 chrN:A-B
ACGTACG...TACAGT
>promoter2 chrN:C-D
...
Convert this to a five-column BED file, where the chromosome, start, stop and ID columns are taken from the FASTA header, and a dummy score value is added for convenience. For example:
#!/usr/bin/env perl
use strict;
use warnings;
while (<>) {
if ($_ =~ /^>/) {
chomp;
my $ln = $_;
$ln =~ s/^>//s;
my ($id, $chr, $start, $stop) = split(/[\s+|:|-]/, $ln);
my $score = 0;
print STDOUT join("\t", ($chr, $start, $stop, $id, $score))."\n";
}
}
Convert the FASTA data to a sorted BED file with a command like:
$ ./fa2bed.pl < myPromoters.fa | sort-bed - > myPromoters.bed
This is a "map" file that contains promoter regions, IDs and (placeholder) scores that we can use as an input to bedmap
:
$ more myPromoters.bed
chrN A B promoter1 0
chrN C D promoter2 0
...
We're now ready to answer the question. We'll first convert your VCF calls to 0-indexed BED using the vcf2bed conversion script. We pipe the conversion results into sort-bed to sort them, and then pipe that into bedmap
(along with myPromoters.bed
) to do the actual mapping:
$ vcf2bed < myCalls.vcf \
| sort-bed - \
| bedmap --echo --echo-map-id --delim '\t' - myPromoters.bed \
> myAnswer.bed
The result (myAnswer.bed
) will be a BED-formatted file. The first four columns will be the (0-indexed) positional data and variant ID of the VCF call. The last column (either the eleventh or twelfth column) will be the ID (or IDs) of promoter regions which overlap that call. (If you want the full promoter element — region and ID — use --echo-map
in place of --echo-map-id
.)
If you need to adjust the boundaries of your definition of a promoter region, you can use BEDOPS bedops
to pre-process your myPromoters.bed
file. As an example, here is how to use the --range
operator to widen the boundaries of each promoter upstream by 10000 bases and downstream by 500 bases, along with the --everything
operator to process all elements:
$ bedops --range 10000:500 --everything myPromoters.bed > myWidenedPromoters.bed
Then use the myWidenedPromoters.bed
file with mapping operations, etc.
thank you very much for your help. well I did convert promoter reference file to a BED format but then I used Vcftools to intersect the VCF file of the patient's genome and the promoter reference file. I found vcftools very easy for beginners.