Addin gene information to VCF file
4
2
Entering edit mode
10.0 years ago
SemiQuant ▴ 80

Hi,

I've been attempting to add gene information to my VCF file. I have a vcf with my variants and a bed file with the gene names and their start and end position. I've tried to add this information to the vcf using GATK VariantAnnotator, vcftools annotate, bcftools annotate, bcftools insec but to no avail. I'm working on M. tuberculosis. Could anyone help me out?

Thanks

SNP next-gen annotation variant • 22k views
ADD COMMENT
0
Entering edit mode

Thanks for the responses. I think the problem is that my BED file is incorrectly formatted. I've tried bcftools annotate using a bgzipped, tabix index file like you mentioned and a tab delimited file indexed with tabix -s1 -b2 -e3 $annotations. Then I tried using a header file in conjunction. Could it be that when saving my tab limited file from excel it incorrectly saves it?

bgzip annotation.bed
tabix -p bed annototation.bed.gz
bgzip variants.vcf
tabix -s1 -b2 -e3 variants.vcf.gz
bcftools annotate -a annotation.bed.gz \
  -h Header_file.hdr \
  -c CHROM,FROM,TO,TAG,ID variants.vcf.gz
ADD REPLY
19
Entering edit mode
9.3 years ago
Sean ▴ 400

I had the exact same scenario, and BCFtools worked great for me. The following will take your gene names from a BED file and add them as an INFO tag in your VCF:

bcftools annotate \
  -a genes.bed.gz \
  -c CHROM,FROM,TO,GENE \
  -h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">') \
  variants.vcf.gz

A few things to note:

  • genes.bed.gz must be a standard BED file that has been zipped using bgzip (bgzip genes.bed)
  • genes.bed.gz must indexed using tabix (tabix -p bed genes.bed.gz)
  • genes.bed.gz does not need a header (we define this with the -c option instead)
  • 'CHROM', 'FROM', and 'TO' are required to be passed to the -c option, but these columns do not actually get added to your VCF (only the 'GENE' will get added)
  • Because we're not annotating from a VCF/BCF file, the -h option is required (it will not work otherwise)
  • Whatever we pass to the -h option will simply get added to the VCF meta-information
ADD COMMENT
4
Entering edit mode
7.9 years ago
duke ▴ 40

ScienceBuff, you probably have figured this out by now. But I just wanted to post an answer because I've been struggling with adding gene annotations to my Mtb vcf files as well. Here is what finally got it to work:

1) Download the H37Rv reference genome from https://www.ncbi.nlm.nih.gov/ (use the genome database). I used the tabular format, which has the following columns:

CHROM     FROM  TO  feature_ID  locus_tag   strand  na_length   gene    product    
NC_000962.3 1   1524    51951   Rv0001  +   1524    dnaA    chromosomal replication initiation protein  
NC_000962.3 2052    3260    52294   Rv0002  +   1209    dnaN    DNA polymerase III subunit beta  
NC_000962.3 3280    4437    52224   Rv0003  +   1158    recF    recombination protein F

2) Compress the file (if not in .gz format):

bgzip H37Rv-ref.tab

3) Index the compressed reference file. Since the reference is in tab format, make sure your parameters match the correct columns (e.g., for the reference file that I used, -s corresponds to CHROM (column 1), -b corresponds to FROM (column 2), -e corresponds to TO (column 3)):

tabix -s 1 -b 2 -e 3 H37Rv-ref.tab.gz

4) Create a header file by creating a new document and naming it something like "header.hdr". Populate the header file with information for each column of the reference/annotation file used (the CHROM, TO, and FROM headers are standard can be left out of this file):

##INFO=<ID=feature_ID,Number=1,Type=Integer,Description="Feature_ID">
##INFO=<ID=locus_tag,Number=1,Type=String,Description="Locus">
##INFO=<ID=strand,Number=1,Type=String,Description="Strand">
##INFO=<ID=na_length,Number=1,Type=Integer,Description="Length">
##INFO=<ID=gene,Number=1,Type=String,Description="Gene">
##INFO=<ID=product,Number=1,Type=String,Description="Product">

5) You are now ready to annotate your vcf file. The -c parameters include the columns that you want to include in the annotated vcf file (CHROM, FROM, and TO are included by default). The dashes (-) indicate columns to leave out. Other than the default parameters (e.g., CHROM, FROM, and TO), the names of other columns to include are preceded by "INFO/".

bcftools annotate \  
-a H37Rv-ref.tab.gz \  
-h header.hdr \  
-c CHROM,FROM,TO,-,INFO/locus_tag,-,-,INFO/gene,INFO/product \   
variants.vcf > annotated.vcf

Note: make sure that the string listed under the CHROM column of the vcf file matches what is listed under the CHROM column of the reference file (e.g., NC_000962.3, in this case).


I'm not sure if these are the issues with your process, but consider:

1) Making the following changes to the column headers of your annotation file:

  • #CHROM to CHROM
  • chromStart to FROM
  • chromEND to TO

So your column headers should look like:

CHROM    FROM   TO    name
chr    1    1524    Rv0001
chr    2052    3260    Rv0002
chr    3280    4437    Rv0003

2) Your header file can then only consist of the following line:

##INFO=<ID=name,Number=1,Type=String,Description="name">

3) Make sure you tabix the gz version of your reference/annotation file using the following parameters: tabix -s 1 -b 2 -e 3 (not -p)

4) I did not have to index my variants (vcf) file using tabix.

5) Make sure the string under the CHROM column of your vcf file matches the name listed under your reference file (i.e., "chr").

6) Your annotation command will then be:

bcftools annotate \  
-a annotation.bed.gz \   
-h header.hdr \   
-c CHROM,FROM,TO,INFO/name\   
variants.vcf.gz> annotated.vcf.gz
ADD COMMENT
0
Entering edit mode

Hi, Thanks for the detailed response. I have the same problem and still didn't solve it. This is my header file:

##INFO=<id=strand,number=1,type=String,description="strand">
##INFO=<id=ID,number=1,type=Integer,description="gene_id">
##INFO=<id=locus,number=1,type=String,description="locus">
##INFO=<id=locus_tag,number=1,type=String,description="locus_tag">
##INFO=<id=protein_product,number=1,type=String,description="protein_product"> 
##INFO=<id=length,number=1,type=Integer,description="length">
##INFO=<id=protein_name,number=1,type=String,description="protein_name">

This is the first two lines of my annotation file (sorted, bgzipped and tabix indexed with tabix -s 1 -b 2 -e 3 annotation_proteins_85_22812_chr_sorted.tab.gz ):

#CHROM  FROM    TO  strand  ID  locus   locus_tag   protein_product length  protein_name
chr1    252392  322063  -   100856150   ENPP1   -   XP_003638833.1  916 ectonucleotide pyrophosphatase/phosphodiesterase family member 1 isoform X2

This is my command line : bcftools annotate -a annotation_proteins_85_22812_chr_sorted.tab.gz -h header.hdr -c CHROM,FROM,TO,-,INFO/ID,INFO/locus,-,INFO/protein_product,-,- $f1 > $out

and still I can't see the gene id (ID) in the annotated vcf files

Could you please help me with this?

ADD REPLY
0
Entering edit mode

Hi I came across the same problem. Could you solve it?

ADD REPLY
0
Entering edit mode
10.0 years ago

I wrote a tool named VCFBed.

curl "https://raw.github.com/arq5x/gemini/master/test/test1.snpeff.vcf" |\
sed 's/^chr//' |\
java -jar  dist/vcfbed.jar TABIXFILE=~/ncbibiosystem.bed.gz TAG=NCBIBIOSYS  FMT='($4|$5|$6|$7)' |\
grep -E '(^#CHR|NCBI)'

##INFO=<ID=NCBIBIOSYS,Number=.,Type=String,Description="metadata added from /home/lindenb/ncbibiosystem.bed.gz . Format was ($4|$5|$6|$7)">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  1094PC0005  1094PC0009  1094PC0012  1094PC0013
1   69270   .   A   G   2694.18 .   AC=40;AF=1.000;AN=40;DP=83;Dels=0.00;EFF=SYNONYMOUS_CODING(LOW|SILENT|tcA/tcG|S60|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.0598;MQ=31.06;MQ0=0;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=32.86    GT:AD:DP:GQ:PL  ./. ./. 1/1:0,3:3:9.03:106,9,0  1/1:0,6:6:18.05:203,18,0
1   69511   .   A   G   77777.27    .   AC=49;AF=0.875;AN=56;BaseQRankSum=0.150;DP=2816;DS;Dels=0.00;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=21.286;HRun=0;HaplotypeScore=3.8956;InbreedingCoeff=0.0604;MQ=32.32;MQ0=0;MQRankSum=1.653;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=27.68;ReadPosRankSum=2.261  GT:AD:DP:GQ:PL  ./. ./. 0/1:2,4:6:15.70:16,0,40 0/1:2,2:4:21.59:22,0,40
ADD COMMENT
0
Entering edit mode
10.0 years ago

Either vcftools annotate or bcftools annotate should very easily do. The only thing you have to bear in mind is that the annotation file (vcf or bed for instance) must be bgzip-compressed and tabix-indexed. A worked example would be the following:

bgzip genes.bed
tabix -p bed genes.bed.gz
bcftools annotate -a genes.bed.gz variants.vcf.gz
ADD COMMENT
0
Entering edit mode

Thanks Jorge,

I'm still struggling with this, I have crated a BED file with only the genes in my VCF to test and it does still not work. I used the following code:

bgzip annotation.bed 
tabix -p bed annototation.bed.gz 
bgzip variants.vcf
tabix -s1 -b2 -e3 variants.vcf.gz

bcftools annotate -a annotation.bed.gz \
  -h Header_file.hdr \
  -c CHROM,FROM,TO,TAG,ID variants.vcf.gz

My VCF was generated by BWA alignment and GATK calling and the BED file is tab delimited looks like this:

#CHROM    chromStart    chromEnd    name
chr    1    1524    Rv0001
chr    2052    3260    Rv0002
chr    3280    4437    Rv0003
chr    4434    4997    Rv0004
chr    5123    7267    Rv0005
chr    7302    9818    Rv0006
chr    9914    10828    Rv0007

P.S my VCF CHROM is also called chr

Any help would be greatly appreciated.

ADD REPLY

Login before adding your answer.

Traffic: 1838 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6