find homozygous SNPs between two samples's VCF files.
2
0
Entering edit mode
6.7 years ago
pramodkp97 ▴ 20

I have done variant calling by bcftools and got vcf files for 18 different samples. Now I want to get the number of SNPs between the samples. Ref or ALT allele should be in the homozygous condition. if one samples has homozygous Ref allele and other sample has homozygous allele then take them into consideration but if both the samples has homozygous ALT allele then ignore that becouse this would not be SNP between the samples. I want to ignore the hetero condition of alleles.

I tried vcf-compare that does not seems answering what i need.

snp genome vcffile assembly • 4.7k views
ADD COMMENT
0
Entering edit mode

do you have 18 VCF or one VCF with 18 samples ? Also , do you have searched biostars.org for similar posts ?

ADD REPLY
0
Entering edit mode

I have 18 vcf files and i also single vcf file for 18 samples made by vcf-merge command. I am okay to choose any one if the solution is there. I have tried searching in biostars.

ADD REPLY
1
Entering edit mode
6.7 years ago

single vcf file for 18 sample

using bioalcidaejdk and a multi-sample vcf: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

loop over each genotype (gx) vs another genotype(gy) , check ( gx.isHomRef && gy.isHomVar) OR (gy.isHomRef && gx.isHomVar)

 gunzip -c input.vcf.gz  |\
java -jar dist/bioalcidaejdk.jar -e 'final Map<String,Long> count = new HashMap<>(); stream().forEach(V->{for(int x=0;x+1< V.getNSamples();x++) {Genotype gx=V.getGenotype(x);if(!gx.isHom()) continue; for(int y=x+1;y< V.getNSamples();++y) {Genotype gy=V.getGenotype(y);if(!gy.isHom()) continue; if(!((gx.isHomVar() && gy.isHomRef()) || (gx.isHomRef() && gy.isHomVar()))) continue; final String pair=gx.getSampleName()+"\t"+gy.getSampleName();long n=count.getOrDefault(pair,0L);count.put(pair,n+1L);}}});count.forEach((K,V)->println(K+"\t"+V));' -F VCF


SAMPLEHOOA  SAMPLEHOZI  10
SAMPLEHOQL  SAMPLE0PU4  2
SAMPLE0Q1Q  SAMPLE0Q3Z  2
SAMPLEHOST  SAMPLE0Q09  2
SAMPLEHONL  SAMPLE0Q1N  2
(...)
ADD COMMENT
0
Entering edit mode

Thank you for the help but there is problem whicle running this command.

java -jar bioalcidaejdk.jar -e 'final Map<String,Long> count = new HashMap<>(); stream().forEach(V->{for(int x=0;x+1< V.getNSamples();x++) {Genotype gx=V.getGenotype(x);if(!gx.isHom()) continue; for(int y=x+1;y< V.getNSamples();++y) {Genotype gy=V.getGenotype(y);if(!gy.isHom()) continue; if(!((gx.isHomVar() && gy.isHomRef()) || (gx.isHomRef() && gy.isHomVar()))) continue; final String pair=gx.getSampleName()+"\t"+gy.getSampleName();long n=count.getOrDefault(pair,0L);count.put(pair,n+1L);}}});count.forEach((K,V)->println(K+"\t"+V));' -F VCF all18_samples_merged.vcf

It generated following:

[DEBUG][BioAlcidaeJdk] Compiling : 1 import java.util.; 2 import java.util.stream.; 3 import java.util.regex.; 4 import java.util.function.; 5 import htsjdk.samtools.; 6 import htsjdk.samtools.util.; 7 import htsjdk.variant.variantcontext.; 8 import htsjdk.variant.vcf.; 9 import com.github.lindenb.jvarkit.util.bio.fasta.FastaSequence; 10 import javax.annotation.Generated; 11 import htsjdk.variant.vcf.; 12 /* begin user's packages / 13 /* end user's packages */ 14 @Generated(value="BioAlcidaeJdk",date="2018-03-27T15:38:35+0530") 15 public class BioAlcidaeJdkCustom461705101 extends com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.VcfHandler { 16 public BioAlcidaeJdkCustom461705101() { 17 } 18 @Override 19 public void execute() throws Exception { 20 // user's code starts here 21 final Map<string,long> count = new HashMap<>(); stream().forEach(V->{for(int x=0;x+1< V.getNSamples();x++) {Genotype gx=V.getGenotype(x);if(!gx.isHom()) continue; for(int y=x+1;y< V.getNSamples();++y) {Genotype gy=V.getGenotype(y);if(!gy.isHom()) continue; if(!((gx.isHomVar() && gy.isHomRef()) || (gx.isHomRef() && gy.isHomVar()))) continue; final String pair=gx.getSampleName()+"\t"+gy.getSampleName();long n=count.getOrDefault(pair,0L);count.put(pair,n+1L);}}});count.forEach((K,V)->println(K+"\t"+V)); 22 //user's code ends here 23 } 24 }

[WARN][SnpEffPredictionParser]no INFO[EFF] or no description. This VCF was probably NOT annotated with SnpEff (old version). But it's not a problem if this tool doesn't need to access SnpEff Annotations.

[WARN][VepPredictionParser]NO INFO[CSQ] found in header. This VCF was probably NOT annotated with VEP. But it's not a problem if this tool doesn't need to access VEP Annotations.

[WARN][AnnPredictionParser]no INFO[ANN] or no description This VCF was probably NOT annotated with SnpEff(ANN version) . But it's not a problem if this tool doesn't need to access SnpEff Annotations.

PLEASE give your suggestion what i am doing wrong.

Do i need to compress into .gz format and run exactly what you have suggested.

ADD REPLY
0
Entering edit mode

there is no problem, these are just some basic messages. You can hide them by redirecting stderr

java -jar bioalcidaejdk.jar -e '(...)' -F VCF all18_samples_merged.vcf 2> /dev/null

if there is no output, it means that there is no genotype matching your criteria.

ADD REPLY
0
Entering edit mode

after

... count.forEach((K,V)->println(K+"\t"+V));'

you can print the number of items in 'count'

... count.forEach((K,V)->println(K+"\t"+V)); System.out.println("SIZE:"+count.size());'
ADD REPLY
0
Entering edit mode

yes It is not generating any output but in reality there should be output of some SNPs between the samples.

ADD REPLY
0
Entering edit mode

show me a VCF line containing a positive test please.

ADD REPLY
0
Entering edit mode
----------
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  filtered_16S_1-1_default_sort.bam       filtered_16S_2-1_default_sort.bam       filtered_16S_3-1_default_sort.bam       filtered_16S_4-1_default_sort.bam       filtered_16S_5-1_default_sort.bam       filtered_16S_6-1_default_sort.bam       filtered_16S_7-1_default_sort.bam       filtered_16S_8-1_default_sort.bam       filtered_16S_9-1_default_sort.bam       filtered_16S_10-1_default_sort.bam      filtered_16S_11-1_default_sort.bam      filtered_16S_12-1_default_sort.bam      filtered_16S_13-1_default_sort.bam      filtered_16S_14-1_default_sort.bam      filtered_16S_15-1_default_sort.bam      filtered_16S_16-1_default_sort.bam      filtered_16S_17-1_default_sort.bam      filtered_16S_18-1_default_sort.bam     
----------
#chr01   28836   .       C       T       71.60   .       AC1=2;AC=6;AF1=1;AN=6;DP4=0,0,12,0;DP=12;FQ=-33;MQ=42;SF=0,5,13;VDB=5.960000e-02        GT:PL:GQ        1/1:67,6,0:10   .       .       .       .       1/1:152,21,0:39 .       .       .       .       .       .       .       1/1:92,9,0:16   .       .       .       .

Note: here .(dot) represent that the genotype is homozygous reference allele i.e. dot==CC allele. here 3 samples having ALTERNATIVE homozygous allele (1/1) and remaining 15 samples are .(dot). So what i want is that it should be SNP between between sample # 1 and aall 15 samples which are dot. sample # 6 and samples which are having dot and also for sample # 14 and all 15 samples which are dott.

but it should not be considered as SNP between samples 1, 6 and 14 becoz all are 1/1 means homozygous ALT alllele.

ADD REPLY
0
Entering edit mode

here .(dot) represent that the genotype is homozygous reference

how am I supposed to know that ? in the VCF spec this is missing data . You didn't mention it anywhere.

The API handling the genotypes is here: https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/variantcontext/Genotype.html : I leave it as an exercice to modify my script.

ADD REPLY
0
Entering edit mode
6.7 years ago
William ★ 5.3k

This can be done 2 two bcftools commands piped together.

The first command selects the two samples of interest out of a VCF that can have multiple samples The second commands filters on that 4 alleles have to be called in total, of which 2 alternative and excludes variants were any of the two samples is heterozygous.

bcftools view -s sample1,sample2 | bcftools view -i 'AN==4 & AC=2'  --genotype ^het
ADD COMMENT
0
Entering edit mode

I have VCF file named all18_sample.vcf . for first command i tried bcftools view -s sample1.bam,sample2.bam all18_sample.vcf

This is what it shows : <bcf_hdr_subsam> 2 samples in the list but not in BCF

There is no output .

ADD REPLY

Login before adding your answer.

Traffic: 2310 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6