If you have a sorted BED file containing genes and their coordinates, you can use BEDOPS bedmap
to map positions in a VCF file to those genes.
The overall process is:
- Convert the VCF file to BED format with the
vcf2bed
script
- Get genes from somewhere and put them into BED format
- Map the converted file to the genes
Conversion
$ vcf2bed < positions.vcf > positions.bed
Obtaining Genes
Now grab some genes and turn them into a BED file. Let's say you are working with mouse (build mm10
). We'll take Ensembl gene annotations:
$ wget -O - ftp://ftp.ensembl.org/pub/mnt2/release-71/gtf/mus_musculus/Mus_musculus.GRCm38.71.gtf.gz \
| gunzip -c - \
| gtf2bed - \
| grep -w gene \
> mm10.genes.bed
You'd modify this step as needed for whichever organism and build you are working with. We use gtf2bed
here to convert the GTF-formatted Ensembl records to sorted BED, and we then filter the output specifically for genes.
Mapping
We have two sorted, BED-formatted inputs: the VCF-sourced positions and our gene annotations. We're now ready to use bedmap
to do mapping:
$ bedmap --echo --echo-map-id positions.bed mm10.genes.bed > answer.bed
The file answer.bed
is a sorted BED file that contains the mutation positions and a semi-colon-delimited list of gene IDs which overlap the position:
[ mutation-1 ] | [ gene-1-1 ] ; ... ; [ gene-1-i ]
[ mutation-2 ] | [ gene-2-1 ] ; ... ; [ gene-2-j ]
...
[ mutation-N ] | [ gene-N-1 ] ; ... ; [ gene-N-k ]
Where no genes overlap a mutation, nothing is printed. The default overlap parameter is one or more bases. You can make this setting more stringent by adjusting overlap criteria.
If you want the entire gene record — and not just the ID — use --echo-map
in place of --echo-map-id
. Other mapping --echo-*
options are available, depending on what you want bedmap
to report.
I keep running into your answers now that I'm working with variants again... hello! :D
Hey, congrats on the new job! Hope you're well.