While Pierrre's code is an efficient solution, it should be mentioned that it is not pipefail-safe because of the use of head
.
Therefore, this will fail if you want to use it in a strict bash setting or inside Snakemake without further error (return code 141).
To make it run, either set +o pipefail
immediately before the code or replace:
... | sort -t $'\t' -T . -k1,1g | head -n 200 ...
with
... | (sort -t $'\t' -T . -k1,1g || true) | head -n 200 ...
Or inside a snakemake rule:
rule subsample_vcf:
input: in.vcf
output: out.vcf
params:
nsnps=200
shell:
"""
bcftools view --header-only --samples-file samples.txt {input} > {output}.unsorted
bcftools view --samples-file samples.txt --no-header {input} |\
awk '{{ printf("%f\t%s\n",rand(),$0);}}' | ( sort -t $'\t' -T . -k1,1g || true) |\
head -n {params.nsnps} | cut -f 2- >> {output}.unsorted
bcftools sort -o {output} {output}.unsorted ## ensure output is sorted by position
## sorting requires header, we need the intermediate file
rm {output}.unsorted
"""
Edit: added the sorting step.
(note the extra space before print which I found is necessary or you get a syntax error)
This is very helpful, and exactly what I was looking for. Is there any chance the subsetting of variants could be done with a known tool such as
gatk SelectVariants
? so as to avoid using awk, sort, etc.? Or would you say that the awk+sort approach is always safe?They have been around for 50 years. I think they're safe.
Pierre Lindenbaum Thanks again for this excellent answer! I've tested this extensively and does exactly what I need. Though I'm still a bit struggling to understand some parts of it and the rationale behind it. For example, why does
awk '{printf("%f\t%s\n",rand(),$0);}'
print the same numbers in the first column every time I run it on the same file? I understandshuf
is a bad solution for choosing variants because it loads the entire file into memory, but couldsort -R
or usingecho $line $RANDOM
achieve the same result or be faster? Currently downsampling my 30GB+ VCF file to 50 variants takes about 1 hour.'{printf("%f\t%s\n",rand(),$0);}' appends a random number before each VCF line.
Look at tabix to (greatly) speed such queries
Without wanting to be picky, I noticed that the VCF will no longer be sorted by position after the subsampling. So another pipe through
bcftools sort
may be necessary.