I have been going around a problem for a 2 days while uploading data to the Michigan imputation server. Any help is much appreciated!
I received .bgen files from 23andme and a single .bgen file contains all the participants and the genotype data. As per the guidelines of Michigan imputation server I converted .bgen file to .vcf file using qctools using the command:
$ qctool -g example.bgen -og example.vcf
Then I followed the following steps (using plink) so that the data could be uploaded to the server:
# compress vcf to gz
bgzip -c ${1}.vcf > ${1}.vcf.gz
# make tabix index
tabix -p vcf ${1}.vcf.gz
# split into 22 separate chromosomes.
tabix -h ${1}.vcf.gz 1 > ${1}.chr01.vcf
tabix -h ${1}.vcf.gz 2 > ${1}.chr02.vcf
tabix -h ${1}.vcf.gz 3 > ${1}.chr03.vcf
tabix -h ${1}.vcf.gz 4 > ${1}.chr04.vcf
tabix -h ${1}.vcf.gz 5 > ${1}.chr05.vcf
tabix -h ${1}.vcf.gz 6 > ${1}.chr06.vcf
tabix -h ${1}.vcf.gz 7 > ${1}.chr07.vcf
tabix -h ${1}.vcf.gz 8 > ${1}.chr08.vcf
tabix -h ${1}.vcf.gz 9 > ${1}.chr09.vcf
tabix -h ${1}.vcf.gz 10 > ${1}.chr10.vcf
tabix -h ${1}.vcf.gz 11 > ${1}.chr11.vcf
tabix -h ${1}.vcf.gz 12 > ${1}.chr12.vcf
tabix -h ${1}.vcf.gz 13 > ${1}.chr13.vcf
tabix -h ${1}.vcf.gz 14 > ${1}.chr14.vcf
tabix -h ${1}.vcf.gz 15 > ${1}.chr15.vcf
tabix -h ${1}.vcf.gz 16 > ${1}.chr16.vcf
tabix -h ${1}.vcf.gz 17 > ${1}.chr17.vcf
tabix -h ${1}.vcf.gz 18 > ${1}.chr18.vcf
tabix -h ${1}.vcf.gz 19 > ${1}.chr19.vcf
tabix -h ${1}.vcf.gz 20 > ${1}.chr20.vcf
tabix -h ${1}.vcf.gz 21 > ${1}.chr21.vcf
tabix -h ${1}.vcf.gz 22 > ${1}.chr22.vcf
# create gz files for each chromosome
bgzip -c ${1}.chr01.vcf > ${1}.chr01.vcf.gz
bgzip -c ${1}.chr02.vcf > ${1}.chr02.vcf.gz
bgzip -c ${1}.chr03.vcf > ${1}.chr03.vcf.gz
bgzip -c ${1}.chr04.vcf > ${1}.chr04.vcf.gz
bgzip -c ${1}.chr05.vcf > ${1}.chr05.vcf.gz
bgzip -c ${1}.chr06.vcf > ${1}.chr06.vcf.gz
bgzip -c ${1}.chr07.vcf > ${1}.chr07.vcf.gz
bgzip -c ${1}.chr08.vcf > ${1}.chr08.vcf.gz
bgzip -c ${1}.chr09.vcf > ${1}.chr09.vcf.gz
bgzip -c ${1}.chr10.vcf > ${1}.chr10.vcf.gz
bgzip -c ${1}.chr11.vcf > ${1}.chr11.vcf.gz
bgzip -c ${1}.chr12.vcf > ${1}.chr12.vcf.gz
bgzip -c ${1}.chr13.vcf > ${1}.chr13.vcf.gz
bgzip -c ${1}.chr14.vcf > ${1}.chr14.vcf.gz
bgzip -c ${1}.chr15.vcf > ${1}.chr15.vcf.gz
bgzip -c ${1}.chr16.vcf > ${1}.chr16.vcf.gz
bgzip -c ${1}.chr17.vcf > ${1}.chr17.vcf.gz
bgzip -c ${1}.chr18.vcf > ${1}.chr18.vcf.gz
bgzip -c ${1}.chr19.vcf > ${1}.chr19.vcf.gz
bgzip -c ${1}.chr20.vcf > ${1}.chr20.vcf.gz
bgzip -c ${1}.chr21.vcf > ${1}.chr21.vcf.gz
bgzip -c ${1}.chr22.vcf > ${1}.chr22.vcf.gz
Then I uploaded the zipped gz files to the server and got the error of malformed header:
Unable to parse header with error: Your input file has a malformed header: Unexpected tag Type in line , for input source: /data3/imputation-server/workspace/job-20190822-200703-201/input/files/64ba01fa-b382-4b48-80b7-fdced5a84e11.vcf (see Help).
I understand that the header is malformed. Is it due to the absence of .sample
file (which contains header information) while I was converting .bgen
to .vcf
format using qctool ?(or something else)
It would be really appreciated if you could tell me a way around!
Please show us the header and some examples of the variants within the vcf file. Otherwise we can just guess.
Thanks!
fin swimmer
I created a link where you can see the .vcf file opened in bash and notepad for your reference: .vcf file in bash and notepad
Any help is much appreciated. Thanks!
Using a vcf-validator may help you pinpoint exactly what could be causing the error.
I used checkVCF to pinpoint the source of error: https://github.com/zhanxw/checkVCF
It showed the following errors:
Does it mean that the .bgen file was not in the right format which i used to convert to .vcf file? Do you know of any way to go about it?
I have no idea what a
.bgen
file is, nor what its format is like. However, there are only 3 errors (each of which is explained quite plainly). You can correct them manually easily enough. One line is duplicated, one has the improper number of columns based on the headers, and one is missing GT in the format field.I am new to this! It would be really helpful if you could help me out with the errors!
It is best to learn by doing. We don't have the ability to scroll through your file. Based on what you've found, you know there's likely an issue with the header and maybe with certain records. Look at the VCF specs and ensure your file meets them (particularly the metadata/header sections).
I was able to rectify the duplicated site error using:
I figured out that using previous commands messed up my VCF header.
But i am still not able to solve the error:
I have defined the format field clearly which does not include GT. I am attaching the snip of the file for your reference.