Entering edit mode
18 months ago
kgwkk2
•
0
This is normal vcf header structure.
## [1] "##fileformat=VCFv4.1"
## [1] "##source=\"GATK haplotype Caller, phased with beagle4\""
## [1] "##FILTER=<ID=LowQual,Description=\"Low quality\">"
## [1] "##FORMAT=<ID=AD,Number=.,Type=Integer,Description=\"Allelic depths fo [Truncated]"
## [1] "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Approximate read [Truncated]"
## [1] "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">"
## [1] "First 6 rows."
## [1]
## [1] "***** Fixed section *****"
## CHROM POS ID REF ALT QUAL FILTER
## [1,] "Supercontig_1.50" "2" NA "T" "A" "44.44" NA
## [2,] "Supercontig_1.50" "246" NA "C" "G" "144.21" NA
## [3,] "Supercontig_1.50" "549" NA "A" "C" "68.49" NA
## [4,] "Supercontig_1.50" "668" NA "G" "C" "108.07" NA
## [5,] "Supercontig_1.50" "765" NA "A" "C" "92.78" NA
## [6,] "Supercontig_1.50" "780" NA "G" "T" "58.38" NA
## [1]
## [1] "***** Genotype section *****"
## FORMAT BL2009P4_us23 DDR7602
## [1,] "GT:AD:DP:GQ:PL" "0|0:62,0:62:99:0,190,2835" "0|0:12,0:12:39:0,39,585"
## [2,] "GT:AD:DP:GQ:PL" "1|0:5,5:10:99:111,0,114" NA
## [3,] "GT:AD:DP:GQ:PL" NA NA
## [4,] "GT:AD:DP:GQ:PL" "0|0:1,0:1:3:0,3,44" NA
## [5,] "GT:AD:DP:GQ:PL" "0|0:2,0:2:6:0,6,49" "0|0:1,0:1:3:0,3,34"
## [6,] "GT:AD:DP:GQ:PL" "0|0:2,0:2:6:0,6,49" "0|0:1,0:1:3:0,3,34"
## IN2009T1_us22 LBUS5 NL07434
## [1,] "0|0:37,0:37:99:0,114,1709" "0|0:12,0:12:39:0,39,585" NA
## [2,] "0|1:2,1:3:16:16,0,48" NA NA
## [3,] "0|0:2,0:2:6:0,6,51" NA NA
## [4,] "1|1:0,1:1:3:25,3,0" NA "0|0:1,0:1:3:0,3,28"
## [5,] "0|0:1,0:1:3:0,3,31" "0|0:1,0:1:3:0,3,34" "0|0:1,0:1:3:0,3,26"
## [6,] "0|0:3,0:3:9:0,9,85" "0|0:1,0:1:3:0,3,34" NA
## [1] "First 6 columns only."
But this is my VCF file. Though it is multi calling VCF, I think it is too weird and too long information. Info section and genotype section are also not normal.
I just used illumina fastq data and used programs as BWA, SAMtools, HaplotypeCaller, GenotypeGVCFs, SelectVariants, and VariantFiltration. Is this normal condition? Because almost half of the tools using input file as vcf got error when I run with this file. Please inform me what is the problem.
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele not already represented at this location by REF and ALT">
##FILTER=<ID=FILTER,Description="QD<2.0||((MQ<40.0||RankSum<-12.5||ReadPosRankSum<-8.0||FS>60.0||SOR>3.0)&&TYPE='snp')||((ReadPosRankSum<-20.0||FS>200.0||SOR>10.0)&&TYPE='indel')">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=GenotypeGVCFs,CommandLine="GenotypeGVCFs --output 100-8-30h.geno.vcf --variant 100-8-30h.vcf --reference NRRL3357.fa --include-non-variant-sites false --merge-input-intervals false --input-is-somatic false --tumor-lod-to-emit 3.5 --allele-fraction-error 0.001 --keep-combined-raw-annotations false --use-posteriors-to-calculate-qual false --dont-use-dragstr-priors false --use-new-qual-calculator true --annotate-with-num-discovered-alleles false....
(omit....)
=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##contig=<ID=NC_054691.1,length=6386556>
##contig=<ID=NC_054692.1,length=6246150>
##contig=<ID=NC_054693.1,length=5100955>
##contig=<ID=NC_054694.1,length=4658713>
##contig=<ID=NC_054695.1,length=4453722>
##contig=<ID=NC_054696.1,length=3936580>
##contig=<ID=NC_054697.1,length=3033036>
##contig=<ID=NC_054698.1,length=3179870>
##source=GenotypeGVCFs
##source=HaplotypeCaller
##source=VariantFiltration
##bcftools_mergeVersion=1.16-7-gf4dee4b+htslib-1.16-11-ga1dec95
##bcftools_mergeCommand=merge --no-index -o Merged.vcf.gz 100-8-30h.filtered.vcf.gz NRRL30797.filtered.vcf.gz 100-8-36h.filtered.vcf.gz NRRL35739.filtered.vcf.gz 100-8-42h.filtered.vcf.gz RIB537.filtered.vcf.gz 14160.filtered.vcf.gz ... (omit, multi calling samples name).... MWX2.filtered.vcf.gz Yazoo-S2.filtered.vcf.gz; Date=Wed May 3 16:57:10 2023
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 100-8-30h NRRL30797 100-8-36h NRRL35739 100-8-42h RIB537 14160 RIB949 2017-Washington-T2 SD016 2017-Washington-T5 SD022 3-042-30h SD035 3-042-36h SD039 3-042-42h SD061 A1 SD24 A9 SD45 AF36 SD59 AR018 SL005 AR028 SL01 Afla-Guard SL015 Aor-06 SL034 Aor-17 SL041 Aor-34 SL044 Aor-38 SL055 BP2-1 SL08 CA14 SL46 CF1 SU-16 CF2 SW1 CF3 TK-1 E1402 TK-10 E1404 TK-11 E1406 TK-12 E1445 TK-13 HK1 TK-14 K54A TK-15 K93210 TK-2 M2040 TK-20 MRI19 TK-24 MWA1 TK-26 MWA2 TK-4 MWA3 TK-5 MWB1 TK-59 MWB2 TK-60 MWB3 TK-7 MWC1 TK-9 MWC2 Tox4 MWC3 VCG1 MWX1 WRRL1519 MWX2 Yazoo-S2
NC_054691.1 59 . G T 34.64 PASS BaseQRankSum=0.088;ExcessHet=3.0103;FS=4.506;MQ=60;MQRankSum=0;QD=2.16;ReadPosRankSum=-1.512;SOR=0.16;DP=23;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1 GT:AD:DP:GQ:PGT:PID:PL:PS ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. 0|1:14,2:21:42:0|1:59_G_T:42,0,576:59 ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.
which error ?
It's hard to explain exact error name because most of them are syntax error or value error. For example, when I run script to run PCA on SNPs data from a vcf file (https://rpubs.com/madisondougherty/980777), it get error because there's a number that can't come out of the formula. (Error in apply(x, 2, sd, na.rm = TRUE) : dim(X) must have a positive length)
I know it's hard to observe the mistaken part, but can you get any strange things about the sequence of Info (DP, AC, ExcessHet and so on) category or others
that's not a vcf parsing error
run your VCF file through
bcftools view
if it passes through that then your VCF is likely valid.but it may not contain information some other tools wants, but that is a different problem altogether