Randomly select a number of variants and samples from a multi-sample VCF file (pVCF)
2
2
Entering edit mode
23 months ago
' ▴ 330

I have a set of pVCF files that I intend to slice.

For example, I want to take 100 samples and 200 variants from the file. How can I do this?

All the tutorials I have come across require using a "list" of sample names and list of variants. But can I just do this randomly?

vcf pvcf • 3.7k views
ADD COMMENT
1
Entering edit mode
23 months ago

All the tutorials I have come across require using a "list" of sample names and list of variants. But can I just do this randomly?

random list of 100 samples:

bcftools query -l in.vcf | shuf | head -n 100 > samples.txt

and 200 variants

bcftools view --header-only --samples-file samples.txt  in.vcf > out.vcf
bcftools view --no-header --samples-file samples.txt  in.vcf | awk '{printf("%f\t%s\n",rand(),$0);}' | sort -t $'\t'  -T . -k1,1g | head -n 200 | cut -f 2- >> out.vcf
ADD COMMENT
1
Entering edit mode

This is very helpful, and exactly what I was looking for. Is there any chance the subsetting of variants could be done with a known tool such as gatk SelectVariants? so as to avoid using awk, sort, etc.? Or would you say that the awk+sort approach is always safe?

ADD REPLY
1
Entering edit mode

They have been around for 50 years. I think they're safe.

ADD REPLY
0
Entering edit mode

Pierre Lindenbaum Thanks again for this excellent answer! I've tested this extensively and does exactly what I need. Though I'm still a bit struggling to understand some parts of it and the rationale behind it. For example, why does awk '{printf("%f\t%s\n",rand(),$0);}' print the same numbers in the first column every time I run it on the same file? I understand shuf is a bad solution for choosing variants because it loads the entire file into memory, but could sort -R or using echo $line $RANDOM achieve the same result or be faster? Currently downsampling my 30GB+ VCF file to 50 variants takes about 1 hour.

ADD REPLY
0
Entering edit mode

'{printf("%f\t%s\n",rand(),$0);}' appends a random number before each VCF line.

ADD REPLY
0
Entering edit mode

Look at tabix to (greatly) speed such queries

ADD REPLY
0
Entering edit mode

Without wanting to be picky, I noticed that the VCF will no longer be sorted by position after the subsampling. So another pipe through bcftools sort may be necessary.

ADD REPLY
1
Entering edit mode
4 months ago
Michael 55k

While Pierrre's code is an efficient solution, it should be mentioned that it is not pipefail-safe because of the use of head. Therefore, this will fail if you want to use it in a strict bash setting or inside Snakemake without further error (return code 141).

To make it run, either set +o pipefail immediately before the code or replace:

  ... | sort -t $'\t'  -T . -k1,1g | head -n 200 ...

with

  ... | (sort -t $'\t'  -T . -k1,1g || true) | head -n 200 ...

Or inside a snakemake rule:

rule subsample_vcf:
  input: in.vcf
  output: out.vcf
  params: 
      nsnps=200
  shell:
      """
          bcftools view --header-only --samples-file samples.txt  {input} > {output}.unsorted
          bcftools view --samples-file samples.txt --no-header {input} |\
            awk '{{ printf("%f\t%s\n",rand(),$0);}}' | ( sort -t $'\t'  -T . -k1,1g || true) |\
            head -n {params.nsnps} | cut -f 2-  >> {output}.unsorted
            bcftools sort -o {output}  {output}.unsorted  ## ensure output is sorted by position
            ## sorting requires header, we need the intermediate file
            rm {output}.unsorted
      """

Edit: added the sorting step.

(note the extra space before print which I found is necessary or you get a syntax error)

ADD COMMENT
0
Entering edit mode

Brief elaboration in case this comment is too technical:

The exit code 141 is issued when the SIGPIPE process attempts to send output to a pipe, but that process (for instance, the process of reading from the other end of the pipe, as here) has already finished.

To avoid, user could either handle SIGPIPE differently or at minimum check if the pipe is still open before attempting to write, then coding some workaround if it isn't (or ignore it, I suppose).

ADD REPLY
0
Entering edit mode

It's really only an issue in Snakemake, and maybe Nexflow, otherwise the user could just turn pipefail off. As one user commented on SO, we're using sharp tools, so we cannot complain if we hurt ourselves using them properly. It may be good to know that the || true prevents it, but of course also masks any other error that arises in sort. The signal is coming from the upstream process of head, where it is head that is causing the 'error' or better works as designed.

ADD REPLY
0
Entering edit mode

Btw: the workflow I am using this in is now public: https://github.com/mdondrup/admixture_simulation Feel free to comment or reuse.

ADD REPLY

Login before adding your answer.

Traffic: 3262 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6