I wanted to extract a vcf of a given sample from gnomAD (https://gnomad.broadinstitute.org/downloads) and do downstream analysis on it. My question is why is that they have bith vcf and tbi given. if tbi is for tab index file why is that I get an error while using it in the following command:
bcftools view -s HGDP00076 gnomad.genomes.v3.1.2.sites.chr1.vcf.bgz.tbi
[E::hts_hopen] Failed to open file gnomad.genomes.v3.1.2.sites.chr1.vcf.bgz.tbi [E::hts_open_format] Failed to open file "gnomad.genomes.v3.1.2.sites.chr1.vcf.bgz.tbi" : Exec format error Failed to read from gnomad.genomes.v3.1.2.sites.chr1.vcf.bgz.tbi: Exec format error
Im using the gnomADv3-variants-genomes data in the link (gnomad.genomes.v3.1.2.hgdp_tgp.chr1.vcf.bgz ). They have multiple samples in the vcf This command takes a very long time to process. Is there any workaround for that?
well the file for chr1 is 261G so it should take a few minutes/hours to complete. You can always try to use the
--threads
option of bcftools to speed up things.yes. but any workaround to extract a particular sample as it takes a really long time! So thats where I wanted to know if indexing helps or only threads is an option?
indexing only helps if you only need a genomic interval.
ok. got it! thank you so much. So im concluding that
cmd:
bcftools view -s HGDP00076 gnomad.genomes.v3.1.2.hgdp_tgp.chr1.vcf.bgz --threads
is the only option to work this out fast.
don't use -s HGDP00076 as there is no genotype/sample in this file. (copy/pasted from Pierre's message)