I subset a vcf file using bed that only include exon region.
The bed file is created by the following command:
gtf_file="/home/user/ref/Ensembl/homo_sapiens_112/Homo_sapiens.GRCh38.112.gtf.gz"
agat_convert_sp_gff2bed.pl --gff ${gtf_file} -o ${exon_bed} --sub exon
cut -f 1-3 ${exon_bed} | sort -k1,1 -k2,2n > exon.bed
After that, I used vcftools to subset my vcf file only to exon region.
bed_file="/home/user/data/exon.bed"
vcftools --vcf ${vcf_dir}/all.vcf \
--out ${vcf_dir}/all_exon \
--bed ${bed_file} --recode
Then, I annotated this vcf file with Ensembl-VEP docker.
docker run -t -i --rm --user $(id -u) \
-v ${docker_workdir}:/data \
-v ${ref_dir}:/ref \
-v ${software_dir}:/software \
-v ${cache_dir}:/cache ensemblorg/ensembl-vep \
vep --assembly GRCh38 --species homo_sapiens --cache --offline \
--dir_cache /cache \
--input_file /data/all.vcf --format vcf \
--output_file /data/all.vep.vcf \
--buffer_size 50000 \
--vcf --force_overwrite \
--plugin CADD,snv=/software/CADD-1.7/data/prescored/GRCh38_v1.7/no_anno/whole_genome_SNVs.tsv.gz,indels=/software/CADD-1.7/data/prescored/GRCh38_v1.7/no_anno/gnomad.genomes.r4.0.indel.tsv.gz \
--plugin AlphaMissense,file=/ref/AlphaMissense/AlphaMissense_hg38.tsv.gz
Everything went fine. But when I check the statistics, I found something abnormal.
The total amount of input variants is 40142. Overlapped genes are 13303. Overlapped transcripts are 85208. It reports 24706 intronic variants (61.5%). But how could that be? I have already masked intronic regions.
At first I thought maybe it is something wrong with my annotation file. But I carefully checked every step, the annotation I used for creating BED, masking VCF etc. are correct. I don't know what's going on.
Do you have any suggestion? Thank you.
It's this line correct in the VEP command?
Yes. This is a normal argument in VEP docker.