Easy way to split VCF file by chromosome
2
4
Entering edit mode
2.9 years ago

Hi,

Im trying to submit a job on the TOPMED/Michigan imputation server, but it returns an error saying that I need to split my VCF by chromosome

Is there an easy way to do this? Will bcftools help?

sequence TOPMED chromosome Imputation SNP • 19k views
ADD COMMENT
10
Entering edit mode
2.9 years ago
bcftools index -s in.vcf.gz | cut -f 1 | while read C; do bcftools view -O z -o split.${C}.vcf.gz in.vcf.gz "${C}" ; done
ADD COMMENT
0
Entering edit mode

omg thanks!!

ADD REPLY
0
Entering edit mode

Hello,

I tried to do this

bcftools index -s /scratch/Databases/GTEx/downloaded/dbGAP/GTEx_Analysis_2017-06-05_v8_WholeExomeSeq_979Indiv_VEP_annot.vcf.gz | cut -f 1 | while read C; do bcftools view -O z -o split.${C}.vcf.gz in.vcf.gz "${C}" ; done

but it throws an error:

[E::hts_open_format] Failed to open file in.vcf.gz
Failed to open in.vcf.gz: No such file or directory
ADD REPLY
1
Entering edit mode

of course it doesn't work. You should have a look at my command line and _understand_ how it works.

ADD REPLY
1
Entering edit mode

You should _explain_ your command

ADD REPLY
0
Entering edit mode
#!/bin/bash
source ~/miniconda3/bin/activate
conda activate train
cd /scratch/downloaded/dbGAP/
for C in {1 .. 22}
do
  bcftools index -s in.vcf.gz | cut -f 1 | while read C
  do
    bcftools view -O z -o /home/data_download/split.${C}.vcf.gz in.vcf.gz "${C}"
  done

I tried to do this but its throwing error:

line 9: syntax error: unexpected end of file
ADD REPLY
0
Entering edit mode

This will Work!enter image description here

ADD REPLY
0
Entering edit mode

why a screenshot when you can just copy and paste the text ?

ADD REPLY
0
Entering edit mode

There are nicer ways of saying that

ADD REPLY
0
Entering edit mode

sorry for my French

ADD REPLY
0
Entering edit mode

This is just wrong. Why do you use two loops when all the chromosomes are available in bcftools index -s

ADD REPLY
0
Entering edit mode

I also tried this, but I am getting the following error:

[E::idx_find_and_load] Could not retrieve index file for 'in.vcf.gz'
Could not load index for VCF: in.vcf.gz

I saw elsewhere bcftools might be bugged (https://github.com/samtools/bcftools/issues/881)? Or am I doing this wrong? The workarounds seem a bit too involved for me to understand.

The command do bcftools view -O z -o split.${C}.vcf.gz in.vcf.gz "${C}" works great. I think it's just something with bcftools index.

For more context, I had some Illumina genotyping array data (.bed, .bim, .fam, .map, .ped) that I converted to .vcf using Plink (and then bgzip). Maybe that's why I'm losing the index somewhere?

Thanks for the help.

ADD REPLY
0
Entering edit mode

You need to run bcftools index on your vcf file before running the suggested command.

ADD REPLY
0
Entering edit mode

Or the more popular tabix -p vcf <vcf_file>.

ADD REPLY
3
Entering edit mode
2.4 years ago
LTDavid ▴ 50

See the explanation below the code

vcf_in=/scratch/Databases/GTEx/downloaded/dbGAP/GTEx_Analysis_2017-06-05_v8_WholeExomeSeq_979Indiv_VEP_annot.vcf.gz

vcf_out_stem=/scratch/Databases/GTEx/downloaded/dbGAP/by_chrom

for i in {1..22}
do
bcftools view ${vcf_in} --regions ${i} -o ${vcf_out_stem}_${i}.vcf.gz -Oz
done
  1. First, define vcf_in using your path to your input file. Here, I assumed that your input file was /scratch/Databases/GTEx/downloaded/dbGAP/GTEx_Analysis_2017-06-05_v8_WholeExomeSeq_979Indiv_VEP_annot.vcf.gz.
  2. Deine vcf_out_stem using the path of the new file. vcf_out_stem should only include the portion of the path and file name leading up to the chromosome number, assuming you want the chromosome number in the path name. Here, I assumed that it would have the same path as the input vcf.
  3. You don't need to make changes to the rest of the code. The for line means that it's going to repeat the code 22 times. The range {1..22} corresponds to the chromosomes you want to subset. This is going to loop through; each time, it's going to take your input file and subset the chromosome using --regions ${i}. If your naming convention for your chromosome is different, you may need to change it to that. For example, if your chromosome name in your input put is chr1, then you may need to use --regions chr${i}. And then -o ${vcf_out_stem}_${i}.vcf.gz is going to give you the new file name by using whatever you defined for vcf_out_stem and adding on the chromosome number using i.
ADD COMMENT
0
Entering edit mode

LTDavid:

My data have chromosomes 1..22, X, Y, and MT.

How do I modify the syntax for i in {1..22}

Thanks

ADD REPLY
2
Entering edit mode

To stick to the for, understand that {1..5} (for example) simply expands to 1 2 3 4 5, which means you can always add more space-separated values like {1..22} X Y MT.

for i in {1..22} X Y MT
do
  bcftools view ${vcf_in} --regions ${i} -o ${vcf_out_stem}_${i}.vcf.gz -Oz
done
ADD REPLY
0
Entering edit mode

Thank you! I will let you know.

ADD REPLY
0
Entering edit mode

The code works good. It created vcf.gz files by chromosome. I'm using Michigan Imputation Server. I uploaded all the files and run the imputation.

During the input validation phase, I got the following error message.

No valid chromosomes found!

Any idea why it gives the error message?

Thanks, again

ADD REPLY
1
Entering edit mode

it's unrelated to the original question.

ADD REPLY
0
Entering edit mode

Yes, it is but the next step after my earlier question. Shall I remove it and repost it on its own?

ADD REPLY
0
Entering edit mode

Hi Ram, could you please comment on this too? I want to try convert 1..22 to chr1..chr22 and run the imputation. How do I modify the above syntax? I really appreciate your help.

ADD REPLY
0
Entering edit mode

If you want to use chr1...chr22, change the command text like so:

for i in {1..22} X Y MT
do
  bcftools view ${vcf_in} --regions chr${i} -o ${vcf_out_stem}_chr${i}.vcf.gz -Oz
done

Just be aware that you'll probably need to use M and not MT as MT becomes chrM.

ADD REPLY
0
Entering edit mode

Thank you very much, Ram.

ADD REPLY
0
Entering edit mode

Ram, can you look at the syntax again? Something is not right. All the separate files have the same size which are probably close to empty. Thanks.

ADD REPLY
0
Entering edit mode

Run the bcftools command without the loop just for chr22 and see the results. If things look OK, add an echo before the bcftools command in the loop and check that the echo'd commands look fine. This is a great opportunity for you to get debugging.

ADD REPLY
0
Entering edit mode
(seq 1 22  && echo X Y MT) | while read i; do ....
ADD REPLY
0
Entering edit mode

Please do not add answers unless you're answering the top level question. Use Add Comment or Add Reply as appropriate. I've moved your post to a comment this time.

ADD REPLY

Login before adding your answer.

Traffic: 1321 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6