Retrieve statistical information from a vcf file using VCF tools
2
0
Entering edit mode
9.3 years ago
niharraul • 0

Hello Everyone,

I have a vcf file containing variants for a 7-8 k patients (exome sequencing) which is not tabix indexed and I am trying to retrieve the following information:

  1. information on columns header
  2. number of monomorphic variants, number of multi-allelic variants
  3. distinguish number of SNPS and indels
  4. Depth of sequencing , human genome build each patient is
  5. exact position of monomorphic and multi-allelic variant
  6. frequency distribution of variants

I am looking for all this information for all the chromosome together and on per chromosome basis. I am using VCFtools on linux environment. But being new to this field I have no clue on how I will be proceeding. All the help is appreciated, if detailed it will be better.

Regards

Nihar

snp • 5.6k views
ADD COMMENT
1
Entering edit mode
9.3 years ago
DG 7.3k

Well, for some of what you are doing you'll probably need more than just vcftools. And it does probably help to bgzip and tabix the file.

A lot of the raw stats you can get with vcf-stat, although I would highly recommend using bcftools stat instead as it is much quicker (uses the fast htslib c-library). That will give you a whole bunch of aggregated statistics on your vcf file. I'll go through each of your points:

  1. What sort of information on column header? A VCF file has pretty standard columns with the only real variable being the columns for individual samples.
  2. bcftools stats will give you this
  3. bcftools stats will give you this
  4. All of the samples in the vcf file should be the same human genome build, and this info should be in the header. Since vcf file rows are laid out by coordinates, and it is the genomic coordinates that tend to change between builds, a file where samples were not aligned and called to the same reference build would be useless and lead to all sorts of downstream analysis problems. If you could even create such a file in the first place. So you shouldn't need a special tool for this, just check the vcf file header. For the depth of sequencing what do you want to measure? Just depth per sample at each position that is present in the vcf file? Depending on the process used to create the VCF file this COULD be every position in the genome, but more likely only includes positions where at least one sample had a variant allele. If so there is a DP field in the genotypes string per sample. It isn't the total reads but is a count of the reads covering that position that passed filters (as defined by whatever program was used for genotype/variant calling). Depending on what you want, if you have access to the BAM files you may want to look at samtools, GATK, bedtools, etc for calculating coverage statistics.
  5. For this I guess you would want to just write a small script, either in perl (so you can sue the VCF.pm library from vcftools) or python (using PyVCF) or whatever language you chose to go through all of the positions in the VCF file and output locations based on passing specified criteria.
  6. Frequency distribution of what specifically? I am assuming you mean something more like variant type or subtype? Transitions, transversions, multi-allelic? specific substitution patterns? For some of this bcftools stats will give you raw counts, you can take the total output and send it to a file and then pass it to plot-vcfstats to get some visualizations. You could also write a script to do some secondary work on the output stats if it isn't produced/covered by the plots.
ADD COMMENT
0
Entering edit mode

Responding to these followup questions:

Hi Dan,

Thanks a lot for your reply. Started with the first 3 point and will write again about the rest.

  1. For column header: General information about each column for example: first column say contains chromosome number , second contains Position, third contains what? And so on ...
  2. Which exact command provides the count of monomorphic and alternative variants
  3. I am trying the following command with the first command I try, the output is following error: bcftools plugin counts input.vcf.gz ..... displays an error : No functional bcftools plugin were found. The enviornment variable BCFTOOLS_PLUGIN is not set. How do I set the environment variable it will be great of you could please provide me how to set the environment variable.

Regards
Nihar

  1. Again, the VCF format is a specification, and the columns are defined by that specification. The most recent (4.2) spec can be found here. It is always Chrom, Pos, ID (usually dbsnp ID), Ref, Alt, Qual, Filter, Info, and Format. The following columns after that are sample columns with the genotype data.
  2. There is no specific command beyond bcftools stats <bgzipped vcf file name>. There are command line options that you can use to refine the stats but the base command outputs a bunch of info, if you look through the results you'll see the data you are looking for.
  3. I haven't been using any plugins and the exact location of your plugins directory would depend on how and where you installed bcftools. How you set environment variables depends on the OS you are using. In Linux you usually set it in .bashrc, .profile, or .bash_profile depending on the terminal you are using, and the distribution. OSX is usually in something like .profile I believe (all of these are usually found in your home directory) and the syntax again depends on the terminal you are using. For most with bash (or dash) it is something like export ENV_VARIABLE="Something" so for instance export BCFTOOLS_PLUGIN="/path/to/bcftools/plugin/directory" You can find detailed instructions for setting environment variables for your particular OS/distribution fairly easily with a few google searches if you need more information.
ADD REPLY
0
Entering edit mode
9.3 years ago
niharraul • 0

Hi Dan,

Thanks a lot for your reply. Started with the first 3 point and will write again about the rest.

  1. For column header: General information about each column for example: first column say contains chromosome number , second contains Position, third contains what? And so on...
  2. Which exact command provides the count of monomorphic and alternative variants
  3. I am trying the following command with the first command I try, the output is following error: bcftools plugin counts input.vcf.gz ..... displays an error : No functional bcftools plugin were found. The enviornment variable BCFTOOLS_PLUGIN is not set. How do I set the environment variable it will be great of you could please provide me how to set the environment variable.

Regards
Nihar

ADD COMMENT
0
Entering edit mode

This should really be posted as a comment to my answer as opposed to an answer on its own. I will respond with details as a comment.

ADD REPLY

Login before adding your answer.

Traffic: 2850 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6