In vcf2bed
, we convert from 1-based, closed [start, end]
Variant Call Format v4 (VCF) to sorted, 0-based, half-open [start-1, end)
extended BED data (cite).
For your example, a single-base variant at position 89017961
would map to 89017960-89017961
, by default.
(Other custom variant process options are available to handle the coordinates in a different fashion; see the documentation for information about the --snvs
, --insertions
and --deletions
command-line options.)
If you plan to integrate your data with other UCSC-formatted BED datasets, consistently using the prefix chr
for chromosome names is a good idea, especially if you plan to integrate toolkits like BEDOPS, GROK or Bedtools to process BED datasets, but there are other approaches you can take, depending on the data or the lab or institution you're working with.
You can fix a lot of this stuff with standard UNIX piping. Building processing pipelines with UNIX pipes is a powerful option.
For example, convert to BED and look at the first few lines with head
:
$ vcf2bed < foo.vcf | head
...
Then use awk
or other tools of choice to modify fields with prefixes, remove capitalization, etc.
To demonstrate, you can prefix chromosome numbers with chr
very easily:
$ vcf2bed < foo.vcf | awk '{ print "chr"$1"\t"$2"\t"$3; }' - > foo.fixed.bed
In addition to the great answers below you might also find the following tutorial useful: Cheat sheet for one-based vs zero-based coordinate systems