Question

Split a VCF file into individual sample files

1

Entering edit mode

10.0 years ago

win ▴ 990

I have the thousand genomes VCF file such as the following ALL.chrX.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz and this contains all the samples data in many, many columns. I want to split this file into separate VCF files, one for each sample. I tried the following code with the new bcftools and kept getting error that -h is not defined. Is there some other way I can convert into individual VCF files?

for file in *.vcf*
do
  for sample in `bcftools view -h $file | grep "^#CHROM" | cut -f10-`
  do
    bcftools view -c1 -Oz -s $sample -o ${file/.vcf*/.$sample.vcf.gz} $file
  done
done

Thanks in advance

VCF • 16k views

ADD COMMENT • link written 10.0 years ago by win ▴ 990

2

Entering edit mode

A: How To Split Multiple Samples In Vcf File Generated By Gatk?

Splitting A Vcf File

A: Splitting A Vcf File

ADD REPLY • link 10.0 years ago by Ashutosh Pandey 12k

0

Entering edit mode

the code i put is from the post suggested and it does not work and my files are not generated using GATK.

ADD REPLY • link 10.0 years ago by win ▴ 990

score 2 · Answer 1 · 2017-03-24

2

Entering edit mode

8.1 years ago

Jorge Amigo 14k

already stated here and here:

for file in *.vcf*; do
  for sample in `bcftools query -l $file`; do
    bcftools view -c1 -Oz -s $sample -o ${file/.vcf*/.$sample.vcf.gz} $file
  done
done

ADD COMMENT • link 8.1 years ago by Jorge Amigo 14k

Ram · Answer 2 · 2015-04-19

1

Entering edit mode

10.0 years ago

Len Trigg ★ 1.7k

I would just use zgrep rather than your bcftools command to get the header:

for file in *.vcf*; do
  for sample in $(zgrep -m 1 "^#CHROM" $file | cut -f10-); do
    bcftools view -c1 -Oz -s $sample -o ${file/.vcf*/.$sample.vcf.gz} $file
  done
done

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 10.0 years ago by Len Trigg ★ 1.7k

0

Entering edit mode

zgrep parses the entire file
bcftools view -h parses only the header, so it's faster
bcftools query -l lists all samples, so it's the fastest

ADD REPLY • link 8.1 years ago by Jorge Amigo 14k

0

Entering edit mode

Incorrect, zgrep -m 1 does not parse the entire file. (bcftools query -l is still better though :-))

ADD REPLY • link 8.1 years ago by Len Trigg ★ 1.7k

0

Entering edit mode

you're right. I didn't realize the -m 1 option that stops reading the file after the first match.

ADD REPLY • link 8.1 years ago by Jorge Amigo 14k

Ram · Answer 3 · 2015-04-17

0

Entering edit mode

10.0 years ago

Brice Sarver ★ 3.8k

view -h is not defined in (at least) bcftools v0.1.19; it is defined in v1.0+. I would check your binaries.

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 10.0 years ago by Brice Sarver ★ 3.8k

score 0 · Answer 4 · 2021-09-29

0

Entering edit mode

3.6 years ago

bacon4000 ▴ 80

FYI, I wrote a tool that cat write out thousands of single-sample VCFs during a single pass through the multi-sample input. The limiting factor will be the OS and file system. Most can handle 10k or more open files before hitting a bottleneck.

https://github.com/auerlab/vcf-split

ADD COMMENT • link 3.6 years ago by bacon4000 ▴ 80

0

Entering edit mode

Being able to do everything in a single pass would be great, but as you say, you have to be able to keep lots of file handles opened, and that can be slower than writting to each file sequentially. Have a look to my "performance note" on this answer, which does the job purely with perl+sort+perl in a 2-pass manner without any OS nor memory limitation, as this is what we have used the few times we had to divide multi-sample files with a few thousands of exomes in them.

ADD REPLY • link 3.6 years ago by Jorge Amigo 14k

0

Entering edit mode

Thanks Jorge - I took a look at your script and quickly realized that comparing the two approaches is complicated. A lot will depend on the size and shape of the multi-sample VCF. More samples and fewer calls will favor your approach, I think, since it avoids the need to partition the samples into multiple passes. I set a hard limit of 10,000 samples in the code to prevent too much I/O bottleneck.

Fewer samples and more calls favors my one-pass approach since it doesn't need to sort a huge temporary file.

Also vcf-split has a feature that was critical to my analysis: It support filtering individual samples for het sites, which greatly reduces the I/O load since most are homozygous.

I need to update my README to better document the pros and cons.

ADD REPLY • link 3.6 years ago by bacon4000 ▴ 80

0

Entering edit mode

Your explanation seems logical to me, so thank you for taking the time and for helping me understand my own code. I generated it for a couple of particular tasks, but I didn't thought that throughly about its limitations. I'll have this in mind if I ever need to use it again.

ADD REPLY • link 3.6 years ago by Jorge Amigo 14k

score 0 · Answer 5 · 2021-10-05

With a task like this, I think one will have to pay the piper in some way. Creating thousands of single-sample files is inherently I/O intensive.

BTW, because I was only interested in het sites for my analysis, I was able to use over 40,000 simultaneous open files in vcf-split without a problem on my workstation. If I were keeping every site I'd want to limit it to under 10k.

For anyone unfamiliar, Unix sort commands limit memory use for large files by sorting small sections, saving the data to temporary files, and merging them. If the list is huge, there are a lot of temp files and it becomes very I/O intensive to write them and merge them. You can mitigate this to some extent using sort --buffer-size, telling it to use larger fragments and hence fewer temp files.

I'm not sure a better strategy exists for files with relatively few calls and more than a few thousand samples.

One other thing I noticed but forgot to mention: Running a VCF with 10,000 samples, the perl script was actually CPU-bound for the first step (separating the fixed fields and reshaping the samples into a linear format). Perl CPU time seems to be the bottleneck, whereas disk I/O is minimal. I haven't checked the split step, but I suspect it's fine since it's simple and writes one file at a time.

FreeBSD barracuda.uits  bacon ~/Barracuda/TOPMed/phased 1005: top

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
12053 bacon         1 103    0    39M    28M CPU0     0 559:53  99.80% perl
12055 bacon         1  23    0    53M    40M pipdwt   1  43:25   7.62% bcftools
12054 bacon         1  20    0    27M    11M piperd   0   1:25   0.33% gsort
17792 bacon         1  20    0    13M  3356K CPU1     1   0:00   0.04% top

FreeBSD barracuda.uits  bacon ~/Barracuda/TOPMed/phased 1006: zpool iostat 1

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
zroot       9.41T  1.47T     42     33   298K   347K
zroot       9.41T  1.47T      0      0      0      0
zroot       9.41T  1.47T      0      0      0      0
zroot       9.41T  1.47T      0    671      0  8.45M
zroot       9.41T  1.47T      0      0   103K      0
zroot       9.41T  1.47T      0      0      0      0
zroot       9.41T  1.47T      0      0      0      0

You might be able to speed it up by reimplementing in C. Biolibc and libxtend can help with this. I may do this as an alternative in vcf-split (with proper credit to Jorge) so users can pick the approach that best suits their data.