Hi,
Im trying to submit a job on the TOPMED/Michigan imputation server, but it returns an error saying that I need to split my VCF by chromosome
Is there an easy way to do this? Will bcftools help?
Hi,
Im trying to submit a job on the TOPMED/Michigan imputation server, but it returns an error saying that I need to split my VCF by chromosome
Is there an easy way to do this? Will bcftools help?
bcftools index -s in.vcf.gz | cut -f 1 | while read C; do bcftools view -O z -o split.${C}.vcf.gz in.vcf.gz "${C}" ; done
See the explanation below the code
vcf_in=/scratch/Databases/GTEx/downloaded/dbGAP/GTEx_Analysis_2017-06-05_v8_WholeExomeSeq_979Indiv_VEP_annot.vcf.gz
vcf_out_stem=/scratch/Databases/GTEx/downloaded/dbGAP/by_chrom
for i in {1..22}
do
bcftools view ${vcf_in} --regions ${i} -o ${vcf_out_stem}_${i}.vcf.gz -Oz
done
vcf_in
using your path to your input file. Here, I assumed that your input file was /scratch/Databases/GTEx/downloaded/dbGAP/GTEx_Analysis_2017-06-05_v8_WholeExomeSeq_979Indiv_VEP_annot.vcf.gz
. vcf_out_stem
using the path of the new file. vcf_out_stem
should only include the portion of the path and file name leading up to the chromosome number, assuming you want the chromosome number in the path name. Here, I assumed that it would have the same path as the input vcf.{1..22}
corresponds to the chromosomes you want to subset. This is going to loop through; each time, it's going to take your input file and subset the chromosome using --regions ${i}
. If your naming convention for your chromosome is different, you may need to change it to that. For example, if your chromosome name in your input put is chr1, then you may need to use --regions chr${i}
. And then -o ${vcf_out_stem}_${i}.vcf.gz
is going to give you the new file name by using whatever you defined for vcf_out_stem
and adding on the chromosome number using i
.To stick to the for
, understand that {1..5}
(for example) simply expands to 1 2 3 4 5
, which means you can always add more space-separated values like {1..22} X Y MT
.
for i in {1..22} X Y MT
do
bcftools view ${vcf_in} --regions ${i} -o ${vcf_out_stem}_${i}.vcf.gz -Oz
done
The code works good. It created vcf.gz files by chromosome. I'm using Michigan Imputation Server. I uploaded all the files and run the imputation.
During the input validation phase, I got the following error message.
No valid chromosomes found!
Any idea why it gives the error message?
Thanks, again
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
omg thanks!!
Hello,
I tried to do this
but it throws an error:
of course it doesn't work. You should have a look at my command line and _understand_ how it works.
You should _explain_ your command
I tried to do this but its throwing error:
This will Work!
why a screenshot when you can just copy and paste the text ?
There are nicer ways of saying that
sorry for my French
This is just wrong. Why do you use two loops when all the chromosomes are available in
bcftools index -s
I also tried this, but I am getting the following error:
I saw elsewhere bcftools might be bugged (https://github.com/samtools/bcftools/issues/881)? Or am I doing this wrong? The workarounds seem a bit too involved for me to understand.
The command
do bcftools view -O z -o split.${C}.vcf.gz in.vcf.gz "${C}"
works great. I think it's just something withbcftools index
.For more context, I had some Illumina genotyping array data (.bed, .bim, .fam, .map, .ped) that I converted to .vcf using Plink (and then bgzip). Maybe that's why I'm losing the index somewhere?
Thanks for the help.
You need to run bcftools index on your vcf file before running the suggested command.
Or the more popular
tabix -p vcf <vcf_file>
.