How to remove shared SNP from a VCF with multiple individuals
1
0
Entering edit mode
2.2 years ago
cuamatzi • 0

Hi!

I have a VCF produced by MafFilter with 29 samples, with the next format (trimmed to 5 strains for easier reading):

##fileformat=VCFv4.0 
##fileDate=202291
##source=Bio++
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=gap,Description="At least one sequence contains a gap">
##FILTER=<ID=unk,Description="At least one sequence contains an unresolved character">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Reference   Strain01    Strain02    Strain03    Strain04    Strain05 
chr09   191 .   G   A   .   PASS    AC=1    GT  0   1   0   0   0   0
chr09   1229    .   T   C   .   PASS    AC=1    GT  0   0   0   1   0   0
chr09   1233    .   T   G   .   PASS    AC=1    GT  0   0   0   1   0   0
chr03   121013  .   G   T   .   PASS    AC=29   GT  0   1   1   1   1   1
chr03   121017  .   G   A   .   PASS    AC=29   GT  0   1   1   1   1   1
chr16   551745  .   T   A   .   PASS    AC=28   GT  0   0   1   1   1   1
chr16   552420  .   A   G   .   PASS    AC=26   GT  0   1   1   0   1   1

This VCF derives from a multiple genome alignment, where Reference is my reference genome, and Strain01 is a collection strain, the Strain02-29 are clones derived from Strain01, that were exposed to some mutagens.

I'd like to remove all the SNPs present in Strain01 from the rest of my strains.

I used the following bcftools command

bcftools view -e'AC=29' input.vcf.gz | bgzip -c > output.vcf.gz

This excludes all variants with AC=29 (meaning that the variants are present in the 29 strains). However, I have some cases where one or more strains don't have one or more SNP from Strain01 but the rest of the strains do (e.g. AC=26 or AC=28). I can set a threshold (e.g 20) and use:

bcftools view -e'AC>20' input.vcf.gz | bgzip -c > output.vcf.gz

But, it could be the case that some strains still carry SNPs present in Strain01. I was thinking in split the VCF into individual VCF files for each strain and then use bcftools isec or vcf-isec, but I'd prefer work with the "full vcf"

Is there a tool or command where I can indicate Strain01 as my background and remove its contribution from all my strains?

Thank you in advance!

VCF background filtering remove • 833 views
ADD COMMENT
1
Entering edit mode
2.2 years ago

I'm not sure I understand clearly, however, using jvarkit vcffilterjdk http://lindenb.github.io/jvarkit/VcfFilterJdk.html . The following cmd returns all variant where ANY sample Strain2-* is different from Strain01

java -jar ${JVARKIT_DIST}/vcffilterjdk.jar -e 'final Genotype g=variant.getGenotype("Strain01"); return variant.getGenotypes().stream().filter(G->!G.getSampleName().equals(g.getSampleName())).anyMatch(G->!G.sameGenotype(g));' in.vcf
ADD COMMENT

Login before adding your answer.

Traffic: 1623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6