Hi
I'm working with variant call format (VCF) files, and I am new to this topic. My goal is to check if there are any mutations of a specific gene (say ARIH1) within a VCF file. How can I do it?
I tried the following command, but I don't know if it's correct:
grep -E "ARIH1" file.vcf
Should I convert the VCF file to something else, first?
How can I check the presence of mutations of that specific gene?
Thanks!
Hi finswimmer
thanks for your help. I tried your command but it generated this error:
What should I do?
Oops, sorry my fault. I edited my answer a bit.
Alright, thanks. First of all, my VCF files are in the hg19 coordinates, therefore they are different from the GRCh38/hg38 of the instruction you suggested me. Can you please provide (or tell me where to find) the ARIH1 gene genomic interval in the hg19 coordinates?
Then, second question: after running the command, how should I read the output? The output of
bcftools view input.vcf.gz 15:72474326-72602985
(which does not make sense, by the way, because of the genomic coordinate issue, as we said) contains 6 genomic regions (rows in a table) and 28 fields (columns in a table). The first ten columns' have the following names: #CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, BS_02D1402.What does this file tell me about the presence of the gene ARIH1?
Thanks! :-)
-- Davide
EDIT: I checked the ENSEMBL website for hg19 and found out that the hg19 coordinates for the ARIH1 gene are chr15:72,766,667-72,879,692. Correct?
Hi Davide,
Yes those coordinates are correct.
The 6 rows in your table will be the individual variants found that fall within those genomic coordinates. More information on the VCF format here. If you have other columns after the INFO I assume that these are sample names, e.g. BS_02D1402. Therefore I suspect that you have genotype information in the file too (looks like this 0|0 or 1|0 for example), you can read about that in the VCF format specification linked to above. e.g.
All this tells you is the location (chromosome 15, base pair 72767002) and the alleles (on the reference genome A, in your sample C (a SNP variant). The rest is quality information.
If you want to find out what the consequence of this variation is on your gene of interest, you will need to run your VCF file through a effect predictor tool, such as Ensembl's Variant effect predictor (VEP). This will then tell you if it is a known (existing) variant, or a novel variant, and if it affects the protein structure of the gene, etc. Here's an example for the variant above.
Hi Erin_Ensembl
Thanks for your reply. I understand that there are 6 individual variants that fall within the genomic coordinates of gene ARIH1. My goal at this stage is to understand if each gene of a gene list I have is "contained" in the VCF file, meaning if there are mutations of each gene in the VCF file.
Let me see if I understand correctly:
If I run the
bcftools view
for the gene X on the VCF file, and the output contains some rows/variants, then it means Yes, that the gene X has mutations in the VCF file;Otherwise, if I run the
bcftools view
for the gene X on the VCF file, and the output is empty or contains no row/variant, then it means No, that the gene X has no mutation in the VCF file.Is it correct?
Yes and yes :)
The important thing to understand is, that a
vcf
doesn't know anything about genes, transcripts, exons, ... It only knows about genomic coordinates, what's the reference base there and how it is different in your sample. It is on you to find additionell information about these changes, e.g. with the help by VEP like Erin_Ensembl tell before.Yes indeed, thanks Finswimmer.
Depending on what you're doing in your analysis you might want to use VEP rather than BCF tools. In using the VEP you can apply filters based on the gene IDs or names, rather than specifying the genomic coordinate which could save you the extra step. You'd also gain other information about consequence (are these variants affecting protein coding regions or not) and other useful information like phenotype annotations and pathogenicity scores.
You can use the VEP through the web tool, REST API or Perl API if that's what you're wanting to do. Full info here.
Hi finswimmer and Erin_Ensembl,
Thanks for your help, all very clear. Using VEP might be an option for the future, but unfortunately not for now because for patients' privacy reasons I cannot upload VCF files anywhere.
Thanks again! :-)
No problem!
If you decide you want to use the VEP in the future, you can get around the uploading the VCF by using the VEP script/Perl API. If you install the cache (available for hg19/GRCh37) then you can run the analysis locally and offline, hence you don't need to worry about the data privacy.
Hi Erin_Ensembl and finswimmer
Thanks for your help. I have another question for you. Among the variants that I select in my VCF files through the
bcftools view
command, I need to choose only the non-synonymous variants.Do you know how I can modify my
bcftools view
accordingly? Or what bash command can I use to do that?Cheers
Hello,
You can only do this if you have annotated your VCF with consequence terms, which are listed in the INFO field. Looking at your previous example it looks like you don't. This is an example of what it might look like:
If you have this, great. You can use VCFtools to simply filter using the info field filtering (see -keep-INFO here). If not, then you can annotate your variants with VEP, or there are other tools available.
Also, non-synonymous typically = missense, but you might want to include other protein altering variants such as stop gained or frameshift variants.
Hi Erin_Ensembl Thanks for your reply, very useful.
Unfortunately, I don't have that information in the variants of my VCF files. I plan to use SnpEff to annotate it.
To select the non-synonymous variants, should I filter only the "missense_variant" variants then? Or something more?
Hi Erin_Ensembl
another question: I am exploring VEP; how are non-synonymous variants labeled in a typical VEP output? I see many biotypes (nonsense_mediated_decay, protein_coding, retained_intron, etc) but I cannot understand if one is suitable for me.
Thanks!
Hey, The full list of variant consequences annotated by the VEP are here. So missense variants will give you SNPs that change the amino acid within the protein, this may be serious for protein function and structure, or not, depending on the chemical properties and conservation of that amino acid at that position in the protein. Pathogenicity predictions are used to determine this.
Other SNPs may lead to a start gain/lost modification which is also going to affect the protein sequence. Aside from SNPs, you may have insertions or deletions which will affect the reading frame (frameshift variants) and also affect the protein sequence.
We have a group of annotations in the Ensembl browser which we refer to as 'missense and protein truncating variant (PTV)' which includes: - missense variant - splice acceptor variant - splice donor variant - stop gained - frameshift variant