I have a number of single-sample-VCF-files and also got for each of them a bed file, consisting of the regions with DP > 10.
What I now want to get is a new bed file, which consists only of regions where 90% of the samples have DP > 10 based on those single bed files. So basically I want a bed file consisting of regions where at least 90% of the primary bed files overlap.
I already know how to work with Granges in R, however I do not know, how to solve the "90%"-thing.
If you have GRanges objects representing regions for each sample, you could call coverage on each of these and then since coverage objects can be added together (cov3 = cov1 + cov2) you could simply add the coverages (you could do this in a loop). After adding all the coverages, you can then simply slice() the coverage object at a depth equal to 90% of your samples. For instance, if you have 10 samples, slice at a depth of 9 to find regions covered by at least 9 samples (90% of your samples). In case individual samples have overlapping regions, you could first reduce them prior to taking coverage(), so that any given sample will only produce a depth of 1 on the genome.
I'm not sure if I fully understood your situation correctly, but you may want to check out the pyvcfsubmodule I wrote if you are familiar with Python:
from fuc import pyvcf
vcf_files =['sample1.vcf', 'sample2.vcf', ..., 'sampleN.vcf']
bed_files =['sample1.bed', 'sample2.bed', ..., 'sampleN.bed']
vf_list =[]for i, vcf_file in enumerate(vcf_files):
vf = pyvcf.VcfFrame.from_file(vcf_file)# Create a VcfFrame object for each sample.
filtered_vf = vf.filter_bed(bed_files[i])# Filter each VcfFrame by the sample's BED file.
vf_list.append(filtered_vf)
merged_vf = pyvcf.merge(vf_list, how='outer')# Merge all the VcfFrame objects.
final_vf = merged_vf.filter_sampnum(0.9)# Select rows where 90% of the samples have the variant.
bf = final_vf.to_bed()# This creates a BedFrame object.
bf.to_file('final.bed')
You may want to explore the filter options in
bcftools view
- that would mean filtering the VCF files on the command line and not in R.