Taking the difference of two VCFs (or removing singletons)
4
1
Entering edit mode
10.3 years ago
hermathena ▴ 40

Dear All,

Is there a way to take a difference of two VCF files? GATK can be used to take a Union or an Intersection, but I need the difference. There are two applications:

  1. remove singletons. I have a VCF of all the SNPs and a VCF of the private ones. I need a VCF with the non-private SNPs.
  2. get the non-CDS sequence. I can make a VCF of the exome SNPs by filtering against a gff file. I would also like to get the SNPs from non-CDS regions - which could be the difference of all SNPs and exome SNPs.

Any ideas, please?

Many thanks,

Krzysztof Kozak
Zoology
University of Cambridge

genome sequencing singleton SNP filter • 4.2k views
ADD COMMENT
0
Entering edit mode

Hello,

Thank you all for the suggestions, this looks promising!

Best,

Chris

ADD REPLY
2
Entering edit mode
10.3 years ago

You can use either vcftools or bcftools. You'll just use the isec command with the -C (complement) option. Note that this is position based rather than exact variant based.

ADD COMMENT
1
Entering edit mode
10.3 years ago
Kizuna ▴ 880

Regarding point 1.

I think you can do it with R.

try to transform your 2 vcf files into Dataframes (DF1 and DF2) and then subset the content of your chromosomic position of the DF2 containing the private variants from the one having all variants (DF1)

this is an example:

DF1<-read.delim("....\allSNPs.vcf",header=T,sep="")
DF2<-read.delim("....\private.SNPs.vcf",header=T,sep="")
singletons.DF1<-DF1[!(DF1$chromosomic.position %in% DF2$chromosomic.positon),]
ADD COMMENT
0
Entering edit mode

The VariantAnnotation package contains a VRanges class that extends GRanges and would be convenient in this instance.

ADD REPLY
0
Entering edit mode
10.3 years ago

I wrote a tool to include/exclude the variants in a VCF file: https://github.com/lindenb/jvarkit/wiki/VcfIn

ADD COMMENT
0
Entering edit mode
7 months ago
Andres ▴ 20

I know this is an old question but when searching for filtering out singletons(and doubletons too) i found this post in the first results. My solution to this problem for a diploid organism, purely in bcftools is:

bcftools view -i 'COUNT(GT="alt")>1' samples.vcf

This is assuming that all singletons will have the alternate allele. If singletons can be found with the ref allele i would add the following:

bcftools view -i 'COUNT(GT="alt")>1 | COUNT(GT="ref")>1' samples.vcf

This logic can be extended further to only heterozygous singletons and such

ADD COMMENT

Login before adding your answer.

Traffic: 2318 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6