ScienceBuff, you probably have figured this out by now. But I just wanted to post an answer because I've been struggling with adding gene annotations to my Mtb vcf files as well. Here is what finally got it to work:
1) Download the H37Rv reference genome from https://www.ncbi.nlm.nih.gov/ (use the genome database). I used the tabular format, which has the following columns:
CHROM FROM TO feature_ID locus_tag strand na_length gene product
NC_000962.3 1 1524 51951 Rv0001 + 1524 dnaA chromosomal replication initiation protein
NC_000962.3 2052 3260 52294 Rv0002 + 1209 dnaN DNA polymerase III subunit beta
NC_000962.3 3280 4437 52224 Rv0003 + 1158 recF recombination protein F
2) Compress the file (if not in .gz format):
bgzip H37Rv-ref.tab
3) Index the compressed reference file. Since the reference is in tab format, make sure your parameters match the correct columns (e.g., for the reference file that I used, -s corresponds to CHROM (column 1), -b corresponds to FROM (column 2), -e corresponds to TO (column 3)):
tabix -s 1 -b 2 -e 3 H37Rv-ref.tab.gz
4) Create a header file by creating a new document and naming it something like "header.hdr". Populate the header file with information for each column of the reference/annotation file used (the CHROM, TO, and FROM headers are standard can be left out of this file):
##INFO=<ID=feature_ID,Number=1,Type=Integer,Description="Feature_ID">
##INFO=<ID=locus_tag,Number=1,Type=String,Description="Locus">
##INFO=<ID=strand,Number=1,Type=String,Description="Strand">
##INFO=<ID=na_length,Number=1,Type=Integer,Description="Length">
##INFO=<ID=gene,Number=1,Type=String,Description="Gene">
##INFO=<ID=product,Number=1,Type=String,Description="Product">
5) You are now ready to annotate your vcf file. The -c
parameters include the columns that you want to include in the annotated vcf file (CHROM, FROM, and TO are included by default). The dashes (-
) indicate columns to leave out. Other than the default parameters (e.g., CHROM, FROM, and TO), the names of other columns to include are preceded by "INFO/".
bcftools annotate \
-a H37Rv-ref.tab.gz \
-h header.hdr \
-c CHROM,FROM,TO,-,INFO/locus_tag,-,-,INFO/gene,INFO/product \
variants.vcf > annotated.vcf
Note: make sure that the string listed under the CHROM column of the vcf file matches what is listed under the CHROM column of the reference file (e.g., NC_000962.3, in this case).
I'm not sure if these are the issues with your process, but consider:
1) Making the following changes to the column headers of your annotation file:
#CHROM
to CHROM
chromStart
to FROM
chromEND
to TO
So your column headers should look like:
CHROM FROM TO name
chr 1 1524 Rv0001
chr 2052 3260 Rv0002
chr 3280 4437 Rv0003
2) Your header file can then only consist of the following line:
##INFO=<ID=name,Number=1,Type=String,Description="name">
3) Make sure you tabix the gz version of your reference/annotation file using the following parameters: tabix -s 1 -b 2 -e 3
(not -p
)
4) I did not have to index my variants (vcf) file using tabix.
5) Make sure the string under the CHROM column of your vcf file matches the name listed under your reference file (i.e., "chr").
6) Your annotation command will then be:
bcftools annotate \
-a annotation.bed.gz \
-h header.hdr \
-c CHROM,FROM,TO,INFO/name\
variants.vcf.gz> annotated.vcf.gz
Thanks for the responses. I think the problem is that my BED file is incorrectly formatted. I've tried bcftools annotate using a bgzipped, tabix index file like you mentioned and a tab delimited file indexed with
tabix -s1 -b2 -e3 $annotations
. Then I tried using a header file in conjunction. Could it be that when saving my tab limited file from excel it incorrectly saves it?