filtering a VCF file based on genotype
1
0
Entering edit mode
6.7 years ago
Bogdan ★ 1.4k

Dear all, I would appreciate having a piece of advice :

how shall I filter a VCF file in order to exclude the GERMLINE and TUMOR samples that have a GENOTYPE of "./".

in R/BioC, for example, if we aim to see the GENOTYPES in the vcf file, we receive the following messages :

> geno(vcf)$GT[,"TUMOR"]
Error in geno(vcf)$GT[, "TUMOR"] : subscript out of bounds
> geno(vcf)$GT[,"NORMAL"]
Error in geno(vcf)$GT[, "NORMAL"] : subscript out of bounds

thank you very much !

VCF • 3.1k views
ADD COMMENT
1
Entering edit mode

What tools are you using? "R/BioC" have multiple tools. Can you please give us the exact commands/scripts you're using?

ADD REPLY
1
Entering edit mode

Dear gentlemen, thank you for your replies and help. Here I am posting the R code that I am using to do the filtering of a VCF file that is generated by DELLY and contains INVERSIONs only .. Any suggestions on how to post/link the input VCF file would be welcome. Thank you very much !

library(devtools)
library(stringr)
library(dplyr)
library(ggplot2)

library(VCFWrenchR)
library(VariantAnnotation)
library(VariantTools)

library(GenomicRanges)
library(GenomicFeatures)
library(rtracklayer)
library(org.Hs.eg.db)

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(StructuralVariantAnnotation) 


vcf <- readVcf("vcf.INV3.vcf", 
                genome="hg38")

name <- "vcf.INV3.vcf"

TUMOR  <-  samples(header(vcf))[1] 
NORMAL <-  samples(header(vcf))[2]


DEPTH_threshold <- 8
FRACTION_threshold <- 0.2 



PASS_IMPRECISE_filters = function(x) { DV_germline <- ( geno(vcf)$DV[,NORMAL] < 1 )
                                       RV_germline <- ( geno(vcf)$RV[,NORMAL] < 1 )

                                       DV_tumor <- geno(vcf)$DV[,TUMOR]
                                       RV_tumor <- geno(vcf)$RV[,TUMOR]  

                                       DR_tumor <- geno(vcf)$DR[,TUMOR]
                                       RR_tumor <- geno(vcf)$RR[,TUMOR] 

                                       AD <-  (DV_tumor > DEPTH_threshold)      
                                       AF <- ((DV_tumor / (DV_tumor + DR_tumor)) > FRACTION_threshold)

                                       DV_germline &
                                       RV_germline & 
                                       AD &
                                       AF & 
                                       (filt(vcf) == "PASS") & 
                                       (info(vcf)$IMPRECISE) 
                                      }

vcf_PASS_IMPRECISE_FILTERS <- vcf[PASS_IMPRECISE_filters(vcf)]

writeVcf(vcf_PASS_IMPRECISE_FILTERS,
         file=paste(name, ".filter.PASS.IMPRECISE.vcf", sep=""))



PASS_PRECISE_filters = function(x) {   DV_germline <- ( geno(vcf)$DV[,NORMAL] < 1 )
                                       RV_germline <- ( geno(vcf)$RV[,NORMAL] < 1 )

                                       DV_tumor <- geno(vcf)$DV[,TUMOR]
                                       RV_tumor <- geno(vcf)$RV[,TUMOR]  

                                       DR_tumor <- geno(vcf)$DR[,TUMOR]
                                       RR_tumor <- geno(vcf)$RR[,TUMOR] 

                                       AD <- ((DV_tumor + RV_tumor) > DEPTH_threshold)      
                                       AF <- ((RV_tumor / (RV_tumor + RR_tumor)) > FRACTION_threshold) 

                                       DV_germline &
                                       RV_germline & 
                                       AD &
                                       AF & 
                                       (filt(vcf) == "PASS") & 
                                       (info(vcf)$PRECISE) 
                                    }


vcf_PASS_PRECISE_FILTERS <- vcf[PASS_PRECISE_filters(vcf)]

writeVcf(vcf_PASS_PRECISE_FILTERS, 
         file=paste(name, ".filter.PASS.PRECISE.vcf", sep=""))


vcf_PASS_COMBINED_FILTERS <- rbind(vcf_PASS_IMPRECISE_FILTERS, vcf_PASS_PRECISE_FILTERS) 

writeVcf(vcf_PASS_COMBINED_FILTERS, 
         file=paste(name, ".filter.PASS.PRECISE.and.IMPRECISE.and.AD.and.AF.vcf", sep=""))
ADD REPLY
0
Entering edit mode

Can you give some background on how the VCF was generated?

ADD REPLY
0
Entering edit mode

Please post the program you are using and the sample vcf file.

ADD REPLY
0
Entering edit mode

I will add to the comments from the 3 other great people. R would not be my 'go to' environment for filtering a VCF. To do what you want, just use bcftools view with the following parameter:

-m/M, --min-alleles/--max-alleles <int>     minimum/maximum number of alleles listed in REF and ALT (e.g. -m2 -M2 for biallelic sites)
ADD REPLY
0
Entering edit mode
6.7 years ago

from your question I understand that you have 1 single vcf file with 2 samples (germline and tumor) in it, and that you want to filter out all variants where 1 or the 2 samples have an empty genotype (./.)

if this is the case, a simple grep should be enough:

grep -vP '\t\./\.' file.vcf > file.non-empty-gt.vcf
ADD COMMENT
0
Entering edit mode

Thank you Jorge. Yes, at the end, I have done it with BCFTOOlS:

bcftools view --genotype ^miss $FILE  > "${FILE%.vcf}.fGT.vcf"

I was hoping to do everything in R, as it would be easier to assemble the entire pipeline in R (than to assemble all the scripts I have in a master bash shell .sh script ) .

ADD REPLY
2
Entering edit mode

Unfortunately, R is not the best for all tasks. You're better off using an automation/pipelining tool that is built for interoperability. People recommend snakemake, but I am not sure how well it would fit your use case, as I'm not conversant with snakemake. For the moment, you can either use system() calls from within R or Rscript/R commands within a shell script.

ADD REPLY
0
Entering edit mode

Thank you Ram for your help. Yes, i've used the "Rscript" calls from a shell script.

ADD REPLY
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLY

Login before adding your answer.

Traffic: 3589 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6