VCF/BCF File format QS score in INFO header
0
1
Entering edit mode
8.6 years ago
ga32huv ▴ 10

Hi everyone,

After calling samtools mpileup my file header looks like this:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##samtoolsVersion=1.3.1+htslib-1.3.1
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">

I did not find a suitable documentation that explains comprehensive the INFO header. I am looking to understand more the QS (Auxiliary tag used for calling) score. Is there a correlation between QS und PL?

Example:

chrom   pos alt ref QUAL    PL:DP   DP  QS  VDB SGB RPB MQB MQBS    BQB

chr1    6579653 G   C,A,T   0   245,0,255,255,255,255,255,255,255,255:798   798 0.732312,0.264416,0.00255397,0.000718305    1,13E-08    -0,693147   0,161204    1   1   0,999199

I would be very greateful for every information I can get. Maybe I overlooked something, but this are the main links where I'ved searched for an answer.

http://samtools.sourceforge.net/mpileup.shtml

https://samtools.github.io/hts-specs/VCFv4.2.pdf

The VCF specification

Meta-information lines

File format

  • always required
  • must be the first line in the file
  • details VCF format version number
    • e.g. ##fileformat=VCFv4.2

Information field format

  • template: ##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="source",Version="version">
    • e.g. ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
  • Possible Types for INFO fields
    • Integer
    • Float
    • Flag
    • Character
    • String
  • Number entry is an Integer that decribes the number of values that can be included with the INFO field
    • 1 for the INFO field contains a single number, 2 for the field descrives a pair of numbers, and so on
    • special characters for special cases
      • 'A' for the field has one value per alternate allele
      • 'R' for the field has one value for each possible allele (including the reference)
      • 'G' for the field has one value for each possible genotype (more relevant to the FORMAT tags)
      • '.' for the number of possible values varies is unknown or unbounded

Filter field format

  • filters that have been applied to the data
  • template: ##FILTER=<ID=ID,Description="description">
  • e.g. ##FILTER=<ID=q10,Description="Quality below 10">

Individual format field format

  • template: ##FORMAT=<ID=ID,Number=number,Type=type,Description="description">
  • e.g. ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
  • Possible Types for FORMAT fields are
    • Integer
    • Float
    • Character
    • String

Alternative allele field format

  • Symbolic alternate alleles for imprecise structural variants
  • can be a colon-separated list of types and subtypes
  • ID values are case sensitive strings and may not conttain whitespace or angle brackets
  • template: ##ALT=<ID=type,Description=description>
  • The first level type must be one of the following
    • DEL
      • Deletion relative to the reference
    • INS
      • Insertion of novel sequence relative to the reference
    • DUP
      • Region of elevated copy number relative to the reference
    • INV
      • Invertion of reference sequence
    • CNV
      • Copy number variation region (may be both deletion and duplication)
      • CNV category should not be used when a more specific category can be applied
      • Reserved subtypes include
        • DUP:TANDEM
          • Tandem duplication
        • DEL:ME
          • Deletion of mobile element relative to the reference
        • INS:ME
          • Insertion of a mobile lement relative to the reference
  • For all of the ##INFO, ##FORMAT, ##FILTER, and ##ALT metainformation, extra fields can be included after the default fields
    • e.g. ##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="description",Version="128">
    • Optional fields should be stored as strings even for numeric values

Assembly field format

  • Breakpoint assemblies for structural variations may use external file
  • template: ##assembly=url
  • The URL field specifies the location of a fasta file containing breakpoint assemblies referenced in the VCF records for structural variants via the BKPTID INFO key

Contig field format

  • highly recommended but nor required
  • The contigs referred to in the VCF file
  • Allowing these contigs to come from different files
  • e.g. ##contig=<ID=ctg1,URL=ftp://somewhere.org/assembly.fa,...>

Sample field format

  • To define sample to genome mappings
  • e.g. ##SAMPLE=<ID=S_ID,Genomes=G1_ID;G2_ID; ...;GK_ID,Mixture=N1;N2; ...;NK,Description=S1;S2; ...;SK>
    • Pedigree field format
  • To record relationships between genomes
    • e.g. ##PEDIGREE=<Name_0=G0-ID,Name_1=G1-ID,...,Name_N=GN-ID>
  • Or a link to a database
    • e.g. ##pedigreeDB=

Header line syntax

  • 8 fixed, mandatory columns
    1. #CHROM
    2. POS
    3. ID
    4. REF
    5. ALT
    6. QUAL
    7. FILTER
    8. INFO
  • tab-delimited
  • If genotype data is present in the file, these are followed by a FORMAT column header, then an arbitrary number of sample IDs
    • e.g. #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

Data lines

Fixed fields

  1. CHROM
    • chromosome
    • String, no white-space permitted, required
    • An identifier from the reference genome or an angle-bracketed ID String pointing to a contig in the assembly file
      • cf. the ##assembly line in the header
    • All entries for a specific CHROM shoud form a contiguous block within the VCF file
    • The colon symbol (:) must be absent from all chromosome names
  2. POS
    • position
    • Integer, required
    • Positions are sorted numerically, in increasing order, within each reference sequence CHROM
    • Having multiple records with the same POS is permitted
    • Telomeres are indicated by using positons 0 or N+1 where N is the length of the corresponding chromosome or contig
  3. ID
    • identifier
    • String, no white-space or semi-colons permitted
    • Semi-colon separated list of unique identifiers where available
    • encouraged to use the rs number(s) if this is a dbSNP variant
    • No identifier should be present in more than one data record
    • missing value should be used if there is no identifier available
  4. REF
    • reference base(s)
    • String, required
    • Each base must be one of A,C,G,T,N (case sensitive)
    • Multiple bases are permitted
    • The value in the POS field refers to the position of the first base in the String
    • If simple insertions and deletions in which either the REF or one of the ALT alleles whould be null/empty
      • unless the event occurs at position 1 on the contigs
        • The REF and ALT Strings must include the base 'before' the event
        • must be reflected in the POS field
      • else
        • It must include the base 'after' the event
      • This padding base is not required
        • although permitted
        • e.g. complex substitutions or other events where all alleles have at least one base represented in their String
    • If any of the ALT alleles is a symbolic allele (an angle bracketed ID String "")
      • The padding base is required
      • POS denotes the coordinate of the base preceding the polymorphism
    • Tools processing VCF files are not required to preserve case in the allele String
  5. ALT
    • alternate base(s)
    • String; no whitespace, commas, or angle-brackets are permitted in the ID String itself
    • Comma separated list of alternate non-reference allels called on at least one of the samples
    • A,C,G,T,N,* (case insensitive) or an angle-bracketed ID String ("")
    • or a breakend replcement string as described in the section on breakends
    • The '*' allele is reserved to indicate that the allele is missing due to a upstream deletion
    • If there are not alternative alleles
      • the missing value should be used
    • Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive
  6. QUAL
    • quality
    • Numeric
    • Phred-scaled quality score for the assertion made in ALT
  7. FILTER
    • filter status
    • String, no white-space or semi-colons permitted
    • PASS if this position has passed all filters
      • i.e. a call is made at this position
    • If the site has not passed all filters
      • a semicolon-separated list of codes for filters that fail
        • e.g. "q10;s50" might indicate that at this site the quality is below 10 and the number of samples with data is below 50% of the total numebr of samples
    • '0' is reserved and should not be used as a filer String
    • If filters have not been applied
      • This field should be set to the missing value
  8. INFO
    • additional information
    • String, no white-space, semicolons, or equals-signs permitted
    • commas are permitted only as delimiters for lists of values
    • Encoded as a semicolon-separated series of short keys with optional values
      • format: =[,data]
      • Arbitatry keys are permitted
    • reserved subfields
      • AA
        • ancestral allele
      • AC
        • allele count in genotype, for each ALT allele, in the same order as listed
      • AF
        • allele frequency for each ALT allele in the same order as listed
        • use this when estimated from primary data, not called genotype
      • AN
        • total number of alleles in called genotypes
      • BQ
        • RMS base quality at this position
      • CIGAR
        • cigar string decribing how to align an alternate allele to the reference allele
      • DB
        • dbSNP membership
      • DP
        • combined depth across samples, e.g. DP=154
      • END
        • end position of the variant descrived in this record
          • for use with symbolic alleles
      • H2
        • membership in hapmap2
      • H3
        • membership in hapmap3
      • MQ
        • RMS mapping quaality, e.g. MQ=52
      • MQ0
        • Number of MAPQ == 0 reads covering this record
      • NS
        • Number of samples with data
      • SB
        • strand bias at this position
      • SOMATIC
        • indicates that the record is a somatic mutation
        • for cancer genomics
      • VALIDATED
        • validated by follow-up experiment
      • 1000G
        • membership in 1000 Genomes
    • The exact format of each INFO sub-field should be specified in the meta-information
      • e.g. DP=154;MQ=52;H2 for an INFO field
    • Keys without corresponding values are allowed in order to indicate group membership
      • e.g. H2 indicates the SNP is found in HapMap 2
    • Not necessary to list all the properties that a site does NOT have
      • e.g. H2=0

Genotype fields

  • If genotype information is present
    • The same type of data must be present for all samples
  • FORMAT field is given specifying the data types and order
    • colon-separated alphanumeric String
  • FORMAT field is followed by one field per sample corresponding to the types spcified in the format
    • colon-separated
  • The first sub-field must always be the genotype (GT) if it is present
  • No required sub-fields

reserved keywords (common and standards across the community)

  • GT
    • genotype
    • encoded as allele values separated by either of / or |
    • The allele values are
      • 0 for the reference allele (what is in the REF field)
      • 1 for the first allele listed in ALT
      • 2 for the second allele listed in ALT
      • and so on
    • For haploid calls
      • e.g. on Y, male non-pseudoautosomal X, or mitochondrion
      • only one allele value shoud be given
    • For triploid call
      • might look like: 0/0/1
    • If a call cannot be made for a sample at a given locus
      • '.' should be specified for each missing allele in the GT field
        • e.g. './.' for a diploid genotype and '.' for haploid genotype
    • The meanings of separators
      • /
        • genotype unphased
      • |
        • genotype phased
  • DP
    • read depth at this position for this sample
  • FT
    • sample genotype filter indicating if this genotype was "called"
    • similar in concept to the FILTER field
    • use PASS to indicate that all filters have been passed
    • a semi-colon separated list of codes for filters that fail
    • '.' to indicate that filters have not been applied
    • should be descrived in the meta-information in the same way as FILTERs
  • GL
    • genotype likelihoods
    • comprised of comma separated floating point log10-scaled likelihoods
      • for all possible genotypes given the set of alleles defined in the REF and ALT field
    • In presence of the GT field
      • the same ploidy is expected
      • the canonical order is used
    • Without GT field
      • diploidy is assumed
    • if A is the allele in REF and B,C, ... are the alleles as ordered in ALT
      • the ordering of genorypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j.
    • For biallelic sites
      • the ordering is: AA, AB, BB
    • For triallelic sites
      • the ordering is: AA,AB,BB,AC,BC,CC, etc.
    • eg. GT:GL 0/1:-323.03,-99.29,-802.53
  • GLE
    • genotype likelihoods of heterogenous ploidy
    • used in presence of uncertain copy number
    • e.g. GLE=0:-75.22,1:-223.42,0/0:-323.03,1/0:-99.29,1/1:-802.53
  • PL
    • the phred-scaled genotype likelihoods rounded to the closest integer
      • otherwise defined precisely as the GL field
  • GP
    • the phred-scaled genotype posterior probabilities
      • otherwise defined precisely as the GL field
      • intended to store imputed genotype probabilities
  • GQ
    • conditional genotype quality
    • encoded as a phred quality
    • -10log10 p(genotype call is wrond, conditioned on the site's being variant)
  • HQ
    • haplotype qualities
    • two comma separated phred qualities
  • PS
    • phase set
    • A phase set is defined as a set of phased genotypes to which this genotype belongs
    • Phased genotypes for an individual that are on the same chromosome and have the same PS value are in the same phased set
    • A phase set specifies multi-marker haplotypes for the phased genotypes in the set
    • All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set
    • If the genotype in the GT field is unphased
      • the corresponding PS field is ignored
    • The recommended convention is tu use the position of the first variant in the set as the PS identifier
      • not required
  • PQ
    • phasing quality
    • the phred-scaled probability that alleles are ordered incorrectly in a heterozygote
      • against all other members in the phase set
    • not included the specific measure for precisely defininf "phasing quality"
    • just to reserve the PQ tag for future use as a measure of phasing quality
  • EC
    • comma separated list of expected alternate allele counts for each alternate allele in the same order as listed in the ALT field
      • typically used in assosiation analyses
  • MQ
    • RMS mapping quality
    • similar to the version in the INFO field

Strict type of keywords

  • GT
    • encoded as allele values separated by either of / or |
    • The allele values are
      • 0 for the reference allele (what is in the REF field)
      • 1 for the first allele listed in ALT
      • 2 for the second allele listed in ALT
    • The meanings of separators - / - genotype unphased - | - genotype phased
  • DP
    • Integer
  • FT
    • String, no white-space or semi-colons permitted
  • GL
    • Floats
  • GLE
    • String
  • PL
    • Integers
  • GP
    • Floats
  • GQ
    • Integer
  • HQ
    • Integers
  • PS
    • Non-negative 32-bit
  • PQ
    • Integer
  • EC
    • Integer
  • MQ
    • Integer

https://cseweb.ucsd.edu/classes/sp16/cse182-a/notes/VCFv4.2.pdf

Thank you very much.

snp • 5.7k views
ADD COMMENT
0
Entering edit mode

I have the exact same question. Can't find an answer anywhere and have no idea why this hasn't been answered!

ADD REPLY

Login before adding your answer.

Traffic: 2309 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6