Question

Extract samples from a bcf files based on genotype?

0

Entering edit mode

5.6 years ago

ste.lu ▴ 80

Hi All,

I am looking for a way (possibly using bcftools) to extract samples based on the variant, but I haven't found it yet.

For instance:

CHROM        POS            REF           ALT         SAMPLE.A              SAMPLE.B               SAMPLE.C
chr1         10             A             TAA         0/0                   0/1                    1/1
chr1         10             C             GGA         0/0                   0/0                    0/1

I want to have samples that are Het or Hom for the SNP 1-10-A-TAA. so the Ideal output should be something similar to:

chr1         10             A             TAA         SAMPLE.B               SAMPLE.C

The file I am working with are huge so I would need the best computational approach.

So far I came out with this:

bcftools query -r chr1:1-15 -i "GT=="AA" & GT=="AR"' -f '%CHR %POS %REF %ALT [/t%SAMPLE=%GT]\n' file_name.bcf

Any suggestions which can improve the speed?

Thanks to whoever spend some time to help me.

sequencing next-gen genome snp • 2.6k views

ADD COMMENT • link updated 5.6 years ago by chrchang523 11k • written 5.6 years ago by ste.lu ▴ 80

score 1 · Answer 1 · 2019-11-30

What's the speed bottleneck here? Are you performing thousands of queries of this type? Or are you performing a smaller number of very large queries?

If you're performing lots of queries, but the number of variants and samples is not that large, bcftools query should be fine, as long as your BCF is indexed.

If you're performing a smaller number of very large queries, and you only have biallelic variants, plink 1.9 is likely to outperform bcftools. You can convert your data to plink-format with

plink --bcf <full BCF filename> --const-fid 0 --out converted ;

your sample query can then be executed with

plink --bfile converted --keep-allele-order --chr 1 --from-bp 1 --to-bp 15 --recode rlist --out result .

See https://www.cog-genomics.org/plink/1.9/formats#rlist for a description of the format of this command's main output file (result.rlist). Positional information for the variants can be looked up in converted.bim, or if you use a plink 1.9 build from 30 Nov 2019 or later, result.map (--recode rlist forgot to generate the .map file in older plink 1.9 builds).

score 0 · Answer 2 · 2019-11-30

0

Entering edit mode

5.6 years ago

Pierre Lindenbaum 166k

using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

 java -jar dist/bioalcidaejdk.jar -e 'stream().forEach(V->println(V.getContig()+ " "+V.getStart()+ " "+V.getReference().getDisplayString()+" "+V.getAlternateAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining(","))+" "+V.getGenotypes().stream().filter(G->!(G.isNoCall() || G.isHomRef())).map(G->G.getSampleName()).collect(Collectors.joining("\t"))));' src/test/resources/rotavirus_rf.vcf.gz


RF01 970 A C S5
RF02 251 A T S2 S3
RF02 578 G A S4
RF02 877 T A S1
RF02 1726 T G S2    S3
RF02 1962 TACA TA S1
RF03 1221 C G S2    S3
RF03 1242 C A S4
RF03 1688 T G S5
RF03 1708 G T S5
(...)

ADD COMMENT • link 5.6 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hi Pierre,

Thanks a lot for your quick answer. Unfortunately, is note that easy to try it, I am working on HPC and they are quite strict about installing new packages. I'll try to git clone it.

Thanks again.

ADD REPLY • link 5.6 years ago by ste.lu ▴ 80

1

Entering edit mode

compile on your side and 'scp' the tool on your cluser...

ADD REPLY • link 5.6 years ago by Pierre Lindenbaum 166k