Question

Why are the basics so complicated? Basic coverage filtering for VCF files

2

Entering edit mode

8 months ago

MBarcelo ▴ 40

I have a very large multisample (N=240) vcf file from a reduced representation sequencing experiment (similar to RADseq). I want to apply basic filtering options and I am not capable of finding an easy straightforward way. I am going to write down what I am doing and I would appreciate any new ideas less time/resource consuming.

Filter low coverage I want to filter genotypes (not samples, not sites) that have a DP below 5. That is remove the genotype information for a particular site AND a particular sample that did not reach enough read depth.
Filter high coverage I want to filter genotypes (not samples, not sites) that have a DP above the 99th percentile within the sample. That is remove the genotype information for a particular site AND a particular sample that has a very high coverage, likely indicating a mapping/repetitive region problem.

I read lots of documentations GATK, picard FilterVcf, bcftools... and I am not able to find an easy solution. For the first filter, I can use GATK in this way:

gatk VariantFiltration -V input.vcf.gz -O output.vcf.gz --genotype-filter-name "LOWDP" --genotype-filter-expression "DP < 5" --set-filtered-genotype-to-no-call true

But, how I could do the same for the second filter, any ideas?

On my desperation, I found another ‘solution’ which it is more a problem than a solution. Split the vcf file per sample using bcftools:

for sample in `bcftools query -l input.vcf.gz`; do
    bcftools view -Oz input.vcf.gz -s $sample -o ${sample}.vcf.gz 
done

Then, now that I only have one sample in each vcf file, I can easily use bcftools and apply both filters. Including in my filtered file only the sites that pass the filtering thresholds:

LIST=$(ls *.vcf.gz)
for vcf_file in $LIST; do
    #Extract sample name from file name
    sample_name=$(basename "$vcf_file" .vcf.gz)
    #echo $sample_name
    #First filtering step: exclude sites with coverage below 5
    bcftools view -Oz -i 'FORMAT/DP>=4' $vcf_file -o filter1_${sample_name}.vcf.gz
    #Second filtering step: exclude sites with coverage above 99th percentile
    #Calculate percentile
    percentile_99=$(bcftools query -f '%FORMAT/DP[\t%GT]\n' $vcf_file | \
                    awk -F'\t' '{split($2, gt_info, ":"); print gt_info[3]}' | \
                    sed 's/\./0/g' | \
            sort -n | \
                    awk 'BEGIN{c=0} {a[c++]=$1} END{print a[int(NR*0.999 - 0.5)]}')
    #echo $percentile_99
    #Apply filter with bcftools:
    bcftools view -Oz -i "FORMAT/DP<=${percentile_99}" filter1_${sample_name}.vcf.gz -o filter2_${sample_name}.vcf.gz
done

The problem with this approach is that it takes a very long time to split the vcf file (5 days and counting, so far). Then, I will end up with 240 ‘small’ vcf files (1.2Gb each) that will go into the filtering loop that who knows how long it will take.

Being this a very basic filtering step, has anyone encountered the same problem and applied a better solution? As an added note, I have been recommended not to use custom scripts to parse vcf files and not to use vcftools since it is no longer mantained. Any thoughts on this too?

Thanks a lot!

filter vcf coverage • 1.3k views

ADD COMMENT • link updated 8 months ago by spen ▴ 40 • written 8 months ago by MBarcelo ▴ 40

0

Entering edit mode

You're trying to treat a text format designed for transmitting information, VCF, as an entire analysis platform. You need to get this file into a typical data science capable environment such as Python or R and these row-based analyses will be much easier.

ADD REPLY • link 8 months ago by Jeremy Leipzig 22k

0

Entering edit mode

I was trying to avoid this since I am not sure I will be capable of scripting something myself but it makes more sense for sure! Thanks!

ADD REPLY • link 8 months ago by MBarcelo ▴ 40

0

Entering edit mode

The problem with this approach is that it takes a very long time to split the vcf file (5 days and counting, so far).

use a workflow manager to parallelize.

Furthermore, what Jeremy Leipzig said, you should use a custom program (python etc...) to run this kind of job.

ADD REPLY • link 8 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

I could have parallelized, good idea. It won't solve the space issue but it will be faster. Thanks!

ADD REPLY • link 8 months ago by MBarcelo ▴ 40

0

Entering edit mode

You could also take a look at HAIL, however is has quite a high learning curve IMO.

ADD REPLY • link 8 months ago by DBScan ▴ 450

score 5 · Answer 1 · 2024-04-09

I can lend a hand assuming you have single-sample, indexed and compressed VCF files:

Organizing large amounts of VCF data for filtering/downstream analysis is best using something like Python or R versus bash as suggested above.

Using Python and TileDB's (array-based data storage API) tiledbvcf library this can be easily done. I urge you to give this a try, I think it will make your workflow a lot simpler and is what you're looking for!

I'm assuming you're unfamiliar with Python so i'll try and give you a quick setup:

Python Setup via `conda`

Install miniconda
Create an isolated env and install python conda create -n biostars_py39 python=3.9
Activate your new env conda activate biostars_py39
Install TileDB-VCF conda install -c conda-forge -c bioconda -c tiledb tiledbvcf-py
Open up a Jupyter Notebook or create a .py file (e.g. from the command line: touch vcf.py)

Dataset and filtering

I'll pull 5 examples from a public AWS S3 bucket, ingest those into a TileDB VCF Dataset and then filter for your desired conditions using pandas.

Create TileDB-VCF Dataset

import tiledbvcf
import numpy as np

# create a tiledbvcf Dataset
dataset_uri = "biostars_vcf"
ds = tiledbvcf.Dataset(dataset_uri, mode="w")
ds.create_dataset()

Ingest some example VCF files from a public S3 bucket, you can instead pass in the path to your local VCF files.

# some example VCF files
vcf_bucket = "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kg-dragen"

# choosing 5 random samples from chr21
samples_to_ingest = [
    "HG00096_chr21.gvcf.gz",
     "HG00097_chr21.gvcf.gz",
     "HG00099_chr21.gvcf.gz",
     "HG00100_chr21.gvcf.gz",
     "HG00101_chr21.gvcf.gz",
]

# load VCFs into dataset
ds.ingest_samples(sample_uris=[f"{vcf_bucket}/{s}" for s in samples_to_ingest])

Open dataset and confirm samples added

# confirm all 5 samples added
ds = tiledbvcf.Dataset(dataset_uri, mode="r") # open in "Read" mode
print(ds.samples())

List available attributes in dataset (All standard VCF attributes are available and parsed in TileDB-VCF Dataset)

# list available parsed VCF attributes
print(ds.attributes())

Transform data with desired attributes and filter as requested in question

# include just one sample, but can perform on all samples in dataset
df = ds.read(attrs=["sample_name", "contig", "pos_start", "pos_end", "fmt_DP"], samples=["HG00099"])

# filter for DP >= 5 and <= 99th percentile
filtered = df.loc[
    (df["fmt_DP"] >= 5) & (df["fmt_DP"] <= np.percentile(df["fmt_DP"], 99)),
    :
]

Results Achieved

Now you have a pandas.DataFrame in memory with the desired VCF attributes and filtered for read depth. Plot your data, manipulate the data frame, or apply additional filtering.
On disk, you have a compressed and queryable TileDB array.
- Using this database you can add additional VCF samples and perform performant and parallelized queries all while minimizing data storage.
- Further addressing your storage concern: create a free S3 bucket and adjust the dataset_uri variable to something like: s3://<yourbucket>/tiledb/biostars_vcf. And as long as you have your AWS secrets set to environment variables the above steps will run exactly the same, except your TileDB database will be written and read from S3, saving your local storage and costing you nothing :)
Consult the references below to see all the other features available when using TileDB-VCF.

Python Setup via conda

Dataset and filtering

Results Achieved

References

Python Setup via `conda`