Hi
I am trying to validate my vcf using gatk validateVariants:
gatk ValidateVariants\
-R Hg19.fasta \
-V VCF.vcf \
--validation-type-to-exclude ALL
But I am receiving the following error message:
A USER ERROR has occurred: Input files reference and features have incompatible contigs: Found contigs with the same name but different lengths:
contig reference = 1 / 249250621
contig features = 1 / 249212879.
reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1, NC_007605, hs37d5]
features contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y]
I am using Hg19 downloaded from gatk as a reference. this is from a test dataset I was using in plink and used --recode flag to output to vcf after performing some QC
the workflow: Export SNP array data from Genome studio into plink format --> Perform QC within plink --> export to vcf format --> Validate VCF for further downstream analysis.
my plink code was as follows: Note: This is just testing the commands
make bed files
plink --file Plink\
--pheno Phenot --pheno-name Surgery\
--make-bed --out PlinkP
QC
plink --bfile PlinkP\
--geno 0.05 --mind 0.05 --maf 0.01 --hwe 0.001\
--make-bed out/Pass_QC
recode to VCF
plink --bfile out/Pass_QC\
--snps-only just-acgt --recode vcf\
--make-bed --output-chr MT --out out/VCF
view
cat VCF.vcf | grep "^#"
gives this:
fileformat=VCFv4.2
fileDate=
source=PLINKv1.90
contig=<ID=1,length=249212879>
contig=<ID=2,length=243041412>
... other chromosomes ...
contig=<ID=22,length=51214797>
contig=<ID=X,length=154854339>
contig=<ID=Y,length=28650344>
INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
If I don't perform the QC and convert straight to VCF:
plink --bfile PlinkP\
--snps-only just-acgt --recode vcf\
--make-bed --output-chr MT --out NoQC/VCF
gives:
fileformat=VCFv4.2
fileDate=
source=PLINKv1.90
contig=<ID=1,length=249222528>
contig=<ID=2,length=243041412>
... other chromosomes ...
contig=<ID=22,length=51214797>
contig=<ID=X,length=154854339>
contig=<ID=Y,length=58856970>
INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
Even with the immediate vcf conversion all my contigs are still lower than the reference Genome used (Hg19) for the gatk Validation - does anyone know the reason for this? Even if the reference wasn't Hg19 the contigs still do not match up with others References.
Also is there a way during vcf conversion in plink to have the reference genome so it is included instead of this statement: "Provisional reference allele, may not be based on real reference genome"?
Any advice would be greatly appreciated.
Thanks!
Hi
Thanks for your reply.
Using --fa as you said:
gives:
doesn't seem to change any output log information and have very similar contig lengths (1 less for each chromosome):
I foud that if I use --ref-from-fa after --fa:
gives
but again no change to contig lengths from above:
Do you have any idea why the contigs are still not changing?
The files we are using don't appear to be corrupt or anything.
Agian thanks for your help!
Please post the full .log file, including the very first line with the version string.
Hi Here is is below:
Note that my first response said "April 2022 or newer".
Hi
thanks for your help this seems to work now.
I hadn't realised:
hadn't given me the most up to date version so took it straight from the cog-genomics site.
Thanks :)