Understanding DiscoSNP++ output VCF file
2
1
Entering edit mode
7.0 years ago
achyR ▴ 10

Hello

I have a small query related to the output of discoSNP++. While analyzing the vcf file generated by vcfcreator, I found multiple "genotypes", which are as follows:

.|. ./. 0|0 0/0 0|1 0/1 1|1 1/1

I was wondering if someone can help me understand what does "./." ".|." "0/0" and "0|0" means.

Thank you for your help.

SNP VCF discoSNP++ Genotype • 1.6k views
ADD COMMENT
3
Entering edit mode
7.0 years ago

Hi Achal, thanks for your question.

Here is an explanation (non limited to discoSnp, and adapted to diploid species).

A genotype provides a way to know for each variant if it exists in the reference allele and/or in the alternative allele.

  • with a / :
    • with a reference genome: the first value corresponds to the reference genome.
    • without a reference genome (discoSnp only): the choice of the reference versus alternative allele is random
  • with a | : the variant is phased with the previous one. The first value corresponds to the same allele than the first allele of the previous genotype. This explains why the 1|0 genotype exists.

About the values:

  • ./. the variant is not seen (missing data)
  • 0/0: homozygous variant only existing in the reference
  • 1/1: homozygous variant only existing in the alternative
  • 0/1: heterozygous variant.

Hope this helps, Pierre

ADD COMMENT
0
Entering edit mode
7.0 years ago
achyR ▴ 10

Hello Pierre

Thank you for your reply. It was helpful. However, I am still confused in interpreting "./."

I have 50 samples listed in the .fof file. Upon completion, discoSNP++ (followed by vcfcreator) outputs a contig fasta file and a vcf file. The vcf file contains numerous rows, each corresponds to single variant, and 9 + 50 columns. These 50 columns corresponds to the variant information within 50 samples used. Now take an example row from the output vcf file:

SNP_higher_path_9480770 56 9480770 C T . . Ty=SNP;Rk=1;UL=6;UR=20;CL=.;CR=.;Genome=.;Sd=. GT:DP:PL:AD:HQ ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 0/0:11:5,37,224:11,0:66,0 ./.:1:.,.,.:1,0:68,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:5:.,.,.:0,5:0,63 ./.:0:.,.,.:0,0:0,0 1/1:1259:25184,3794,59:0,1259:0,66 ./.:0:.,.,.:0,0:0,0 1/1:43:864,134,6:0,43:0,65 1/1:38:764,119,6:0,38:0,64 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 1/1:34:684,107,6:0,34:0,66 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0

Here you see that most columns have "./." and some have "1/1".

Now my question is how should I interpret samples with genotype "./."? Should I interpret is as the contig "SNP_higher_path_9480770" is missing in this particular sample OR the contig is present but without any variation?

Hope you get my query. Thanks

ADD COMMENT
0
Entering edit mode

./. (for read set i): the variant whose id is SNP_higher_path_9480770 has not enough corresponding reads in the read set i.

not enough means that both alleles are not read coherent (cf read coherent definition in the publication)

Pierre

ADD REPLY

Login before adding your answer.

Traffic: 1797 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6