Hello,
I started imputation with IMPUTEV2. I know it is an old version but I came this far and I would like to finish it. Now after converting the IMPUTE2 output to VCF files using SHAPEIT, I need to piece the chromosome chunks back together.
I am using bcftools in merging all the data : (my purpose is to merge all the VCF files and then import in plink file for QC and then making IBS matrix)
$BCFTOOLS merge --merge all vcf_chr1_chunk1.vcf.gz vcf_chr1_chunk2.vcf.gz vcf_chr1_chunk3.vcf.gz-O v > $results_merged_vcf'/Chromosome1.imputed.vcf
Though the program asks for the index file using
tabix -p vcf file.vcf.gz
I can not index the file as it needs to be sorted by chromosome. when I try to sort the program mention it can not parse through [--- 45000037].
In my VCF file (after converting data from IMPUTE2 output file to VCF file) I have the format below for the imputed variants.
Here is the format of my VCF file:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1610 2318 2421
--- 45000037 rs140328665:45000037:G:A G A . PASS . GT 0|0
--- 45002242 1:45002242:G:A G A . PASS . GT 0|0
So here are my questions:
- Should I write a code in converting all the '---' to chromosome number for each chunk?
for the cases below:
--- 45002242 1:45002242:G:A G A . PASS . GT 0|0
Should I manually format the second column to just the variant ID?
Should I loop over all the files and remove them, or there is a way to remove them from the beginning?
Is there any software to do this step for me for the sake of saving time?
Thank you for your help in advance
How did you convert it from IMPUTE2 output to vcf? Seems weird that it would just discard the chromosome identifer. My suggestion would be to use a different conversion tool (plink / qctool) to convert to vcf which retains the chromosome identifier. This way you don't have to write any manual tools to do it for you.
I used shapeit to convert from IMPUTE2 format to VCF. There was no chromosome number from the beginning as these lines are imputed. This is the format in IMPUTE2 output format:
--- 1:35000209:G:C 35000209 G C 0 0 0
--- rs75886048:35000218:C:A 35000218 C A 1 0
--- 1:35000252:T:G 35000252 T G 0 0 0 0
and this is the output format in the VCF:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
--- 45000037 rs140328665:45000037:G:A G A . PASS . GT 0|0
--- 45002242 1:45002242:G:A G A . PASS . GT 0|0
The QCTOOLS doesn't read the '---' format and doesn't work. So you mean this VCF output is not the correct version output from shapeit2?
Thanks