How to check the presence of mutations of a specific gene in a specific VCF file?
1
0
Entering edit mode
5.8 years ago
Davide Chicco ▴ 120

Hi

I'm working with variant call format (VCF) files, and I am new to this topic. My goal is to check if there are any mutations of a specific gene (say ARIH1) within a VCF file. How can I do it?

I tried the following command, but I don't know if it's correct:

grep -E "ARIH1" file.vcf

Should I convert the VCF file to something else, first?

How can I check the presence of mutations of that specific gene?

Thanks!

vcf genes • 8.2k views
ADD COMMENT
3
Entering edit mode
5.8 years ago

Hello Davide Chicco ,

whether a simple grep for the gene name works, depends on if the vcf file is annotated with this information. The best way is to find chromosomal position of your gene of interest (for ARIH1 this is 15: 72,474,326-72,602,985 in case you are using GRCh38/hg38 as the reference genome) and filter your vcf file based on that position.

For filtering vcf files I strongly recommend tools that are specialized on working with vcf. My favorite is bcftools. To get all variants, that are located in the chromosomal range of interest, you can do this:

$ bcftools view input.vcf -t 15:72474326-72602985

Have a look into your file before if the chromosome names are prefixed with chr e.g. chr15 or not and adopt the command line.

If you have a large vcf file I recommend to compress your file with bgzip and index it. Doing this bcftools have random access to the region which is much faster:

$ bgzip -c input.vcf > input.vcf.gz
$ tabix input.vcf.gz
$ bcftools view input.vcf.gz 15:72474326-72602985
ADD COMMENT
0
Entering edit mode

Hi finswimmer

thanks for your help. I tried your command but it generated this error:

Failed to open fileName.gatk.snp.indel.vcf: not compressed with bgzip

What should I do?

ADD REPLY
0
Entering edit mode

Oops, sorry my fault. I edited my answer a bit.

ADD REPLY
0
Entering edit mode

Alright, thanks. First of all, my VCF files are in the hg19 coordinates, therefore they are different from the GRCh38/hg38 of the instruction you suggested me. Can you please provide (or tell me where to find) the ARIH1 gene genomic interval in the hg19 coordinates?

Then, second question: after running the command, how should I read the output? The output of bcftools view input.vcf.gz 15:72474326-72602985 (which does not make sense, by the way, because of the genomic coordinate issue, as we said) contains 6 genomic regions (rows in a table) and 28 fields (columns in a table). The first ten columns' have the following names: #CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, BS_02D1402.

What does this file tell me about the presence of the gene ARIH1?

Thanks! :-)

-- Davide

EDIT: I checked the ENSEMBL website for hg19 and found out that the hg19 coordinates for the ARIH1 gene are chr15:72,766,667-72,879,692. Correct?

ADD REPLY
0
Entering edit mode

Hi Davide,

Yes those coordinates are correct.

The 6 rows in your table will be the individual variants found that fall within those genomic coordinates. More information on the VCF format here. If you have other columns after the INFO I assume that these are sample names, e.g. BS_02D1402. Therefore I suspect that you have genotype information in the file too (looks like this 0|0 or 1|0 for example), you can read about that in the VCF format specification linked to above. e.g.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
15  72767002  .  A  C   100 PASS

All this tells you is the location (chromosome 15, base pair 72767002) and the alleles (on the reference genome A, in your sample C (a SNP variant). The rest is quality information.

If you want to find out what the consequence of this variation is on your gene of interest, you will need to run your VCF file through a effect predictor tool, such as Ensembl's Variant effect predictor (VEP). This will then tell you if it is a known (existing) variant, or a novel variant, and if it affects the protein structure of the gene, etc. Here's an example for the variant above.

ADD REPLY
0
Entering edit mode

Hi Erin_Ensembl

Thanks for your reply. I understand that there are 6 individual variants that fall within the genomic coordinates of gene ARIH1. My goal at this stage is to understand if each gene of a gene list I have is "contained" in the VCF file, meaning if there are mutations of each gene in the VCF file.

Let me see if I understand correctly:

  • If I run the bcftools view for the gene X on the VCF file, and the output contains some rows/variants, then it means Yes, that the gene X has mutations in the VCF file;

  • Otherwise, if I run the bcftools view for the gene X on the VCF file, and the output is empty or contains no row/variant, then it means No, that the gene X has no mutation in the VCF file.

Is it correct?

ADD REPLY
0
Entering edit mode

Yes and yes :)

The important thing to understand is, that a vcf doesn't know anything about genes, transcripts, exons, ... It only knows about genomic coordinates, what's the reference base there and how it is different in your sample. It is on you to find additionell information about these changes, e.g. with the help by VEP like Erin_Ensembl tell before.

ADD REPLY
1
Entering edit mode

Yes indeed, thanks Finswimmer.

Depending on what you're doing in your analysis you might want to use VEP rather than BCF tools. In using the VEP you can apply filters based on the gene IDs or names, rather than specifying the genomic coordinate which could save you the extra step. You'd also gain other information about consequence (are these variants affecting protein coding regions or not) and other useful information like phenotype annotations and pathogenicity scores.

You can use the VEP through the web tool, REST API or Perl API if that's what you're wanting to do. Full info here.

ADD REPLY
0
Entering edit mode

Hi finswimmer and Erin_Ensembl,

Thanks for your help, all very clear. Using VEP might be an option for the future, but unfortunately not for now because for patients' privacy reasons I cannot upload VCF files anywhere.

Thanks again! :-)

ADD REPLY
0
Entering edit mode

No problem!

If you decide you want to use the VEP in the future, you can get around the uploading the VCF by using the VEP script/Perl API. If you install the cache (available for hg19/GRCh37) then you can run the analysis locally and offline, hence you don't need to worry about the data privacy.

ADD REPLY
0
Entering edit mode

Hi Erin_Ensembl and finswimmer

Thanks for your help. I have another question for you. Among the variants that I select in my VCF files through the bcftools view command, I need to choose only the non-synonymous variants.

Do you know how I can modify my bcftools view accordingly? Or what bash command can I use to do that?

Cheers

ADD REPLY
0
Entering edit mode

Hello,

You can only do this if you have annotated your VCF with consequence terms, which are listed in the INFO field. Looking at your previous example it looks like you don't. This is an example of what it might look like:

8   38144352    rs551783234 A   G   .   .   CSQ=G|missense_variant|MODERATE|STAR|ENSG00000147465|Transcript|ENST00000276449|protein_coding|7/7||||1226|779|260|L/P|cTg/cCg|rs551783234&CM053426||-1||HGNC|HGNC:11359|1|P1|deleterious(0)|probably_damaging(0.994)|0.0002|likely_pathogenic||1&1|||||,G|missense_variant|MODERATE|STAR|ENSG00000147465|Transcript|ENST00000522050|protein_coding|5/5||||621|622|208|C/R|Tgt/Cgt|rs551783234&CM053426||-1|cds_start_NF&cds_end_NF|HGNC|HGNC:11359|5||deleterious(0)|benign(0)|0.0002|likely_pathogenic||1&1|||||

If you have this, great. You can use VCFtools to simply filter using the info field filtering (see -keep-INFO here). If not, then you can annotate your variants with VEP, or there are other tools available.

Also, non-synonymous typically = missense, but you might want to include other protein altering variants such as stop gained or frameshift variants.

ADD REPLY
0
Entering edit mode

Hi Erin_Ensembl Thanks for your reply, very useful.

Unfortunately, I don't have that information in the variants of my VCF files. I plan to use SnpEff to annotate it.

To select the non-synonymous variants, should I filter only the "missense_variant" variants then? Or something more?

ADD REPLY
0
Entering edit mode

Hi Erin_Ensembl

another question: I am exploring VEP; how are non-synonymous variants labeled in a typical VEP output? I see many biotypes (nonsense_mediated_decay, protein_coding, retained_intron, etc) but I cannot understand if one is suitable for me.

Thanks!

ADD REPLY
0
Entering edit mode

Hey, The full list of variant consequences annotated by the VEP are here. So missense variants will give you SNPs that change the amino acid within the protein, this may be serious for protein function and structure, or not, depending on the chemical properties and conservation of that amino acid at that position in the protein. Pathogenicity predictions are used to determine this.

Other SNPs may lead to a start gain/lost modification which is also going to affect the protein sequence. Aside from SNPs, you may have insertions or deletions which will affect the reading frame (frameshift variants) and also affect the protein sequence.

We have a group of annotations in the Ensembl browser which we refer to as 'missense and protein truncating variant (PTV)' which includes: - missense variant - splice acceptor variant - splice donor variant - stop gained - frameshift variant

ADD REPLY

Login before adding your answer.

Traffic: 1617 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6