Question

Isolating and extracting regions in VCF file

1

Entering edit mode

14 months ago

iarmir ▴ 10

I want to isolate specific regions in the vcf file chr17:63480412 (rs3730025) and chr17:63481680 (rs56394458) into a separate vcf file with only the SNP data from all my hundreds of files.

I only want to annotate these regions. Isolating them will help me annotate them many times faster

How do I do this?

ANNOVAR vcftools bcftools GATK VCF • 1.6k views

ADD COMMENT • link updated 14 months ago by Jeremy Leipzig 23k • written 14 months ago by iarmir ▴ 10

0

Entering edit mode

Prior threads that can help:

Extracting regions around sites in VCF file and writing them as fasta sequences/files
Isolate a Region in a Vcf File to make a Smaller Vcf File

ADD REPLY • link 14 months ago by GenoMax 150k

0

Entering edit mode

I see you tagged bcftools. Have you looked at the documentation?

ADD REPLY • link 14 months ago by WouterDeCoster 47k

score 2 · Answer 1 · 2024-01-19

2

Entering edit mode

14 months ago

Jeremy Leipzig 23k

Once you really have hundreds of files, it's time to consider a variant warehouse. Then you will be spending your time querying instead of manipulating VCF files.

One open source warehouse is TileDB-VCF (repo)

import tiledb
import tiledbvcf

vcf_uri = "my_vcf_dataset"
ds = tiledbvcf.Dataset(uri="mytiledbvcfstore", mode="w")
ds.create_dataset(enable_variant_stats=True, enable_allele_count=True) 
ds.ingest_samples(sample_uris = "input.vcf.gz")

df = ds.read(
   regions = ["chr17:63480412-63480412","chr17:63481680-63481680"]
   attrs   = ["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],
)
df

We do have some scripts that can help with VEP annotation on TileDB arrays

More info here (blog) and here (docs)

ADD COMMENT • link 14 months ago by Jeremy Leipzig 23k

0

Entering edit mode

Thank you a lot! Never heard about TileDB

Can I write so that all my vcf files are sent to input?

"ds.ingest_samples(sample_uris = "*.DeepVariant.vcf.gz")"?

Will it work correctly?

ADD REPLY • link 14 months ago by iarmir ▴ 10

0

Entering edit mode

for wildcards on a local disk you can use the cli

tiledbvcf store --uri my_vcf_dataset *.DeepVariant.vcf.gz

or just use python glob

ds.ingest_samples(sample_uris = glob.glob("*.DeepVariant.vcf.gz"))

ADD REPLY • link 14 months ago by Jeremy Leipzig 23k

GenoMax · Answer 2 · 2024-01-19

1

Entering edit mode

14 months ago

Istvan Albert 102k

A the manual states you can use the -r regions, or the -R, regions-file parameters:

https://samtools.github.io/bcftools/bcftools.html

it would look like:

bcftools view -r chr17:63480412,chr17:63481680 input.vcf -o output.vcf

or

bcftools view -R regions.bed input.vcf -o output.vcf

now to operate on all files, use gnu parallel, assuming your filenames are in a file, it would look something like:

cat filenames.txt | parallel bcftools view -R regions.bed {} -o {}.small.vcf

ADD COMMENT • link 14 months ago by Istvan Albert 102k

0

Entering edit mode

input.vcf.gz or input.bcf .vcf should be bgzipped+ indexed so it cannot be a plain vcf

or use bcftools view -T

ADD REPLY • link 14 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks a lot! Can I use this?

zcat *.vcf.gz | parallel bcftools view -R regions.bed {} -o {}.small.vcf

My vcf files are in .vcf.gz format

After regions.bed should the brackets be empty?

ADD REPLY • link updated 14 months ago by GenoMax 150k • written 14 months ago by iarmir ▴ 10