Hello,
I have a vcf file and want to find the counts for how many variants are occurring for each of the 3 samples independently. I have a vcf that shows the different variants between the samples (ex. 0/1 0/0 1/1 vs 0/1 0/1 0/1). I have tried some of the code and help from other posts but nothing is properly working. I have tried the codes below and they did not work or were incorrectly counting.
Any help would be greatly appreciated. Thank you!
$ java -jar dist/bioalcidaejdk.jar -e 'stream().flatMap(V->V.getGenotypes().stream().filter(G->G.isCalled()&&!G.isHomRef()).map(G->V.getContig()+"\t"+G.getSampleName())).collect(Collectors.groupingBy(Function.identity(),Collectors.counting())).forEach((K,C)->println(K+":"+C));' example.vcf
$ java -jar dist/bioalcidaejdk.jar -e 'final Map<String,Long> count=new HashMap<>(); stream().forEach(V->{for(int i=0;i +1 < V.getNSamples();i++) for(int j=i+1;j < V.getNSamples();j++) {final Genotype gi= V.getGenotype(i);final Genotype gj= V.getGenotype(j); if(gi.sameGenotype(gj)) {String key=gi.getSampleName()+"_"+gj.getSampleName(); long n = 1L + (count.containsKey(key)?count.get(key):0L); count.put(key,n); };} }); count.forEach((K,V)->println(K+" "+V)); ' example.vcf
Below are 2 examples from the codes used above. The first code didn't seem to work as I went through a .hmp file of the variants on excel and manually went through a couple of chromosomes counting for a sample and the numbers didn't add up for that chromosome. The second code gave back counts that were the same for all 3 samples and I know that isn't correct.
[DEBUG][BioAlcidaeJdk] Compiling :
1 import java.util.*;
2 import java.util.stream.*;
3 import java.util.regex.*;
4 import java.util.function.*;
5 import htsjdk.samtools.*;
6 import htsjdk.samtools.util.*;
7 import htsjdk.variant.variantcontext.*;
8 import htsjdk.variant.vcf.*;
9 import com.github.lindenb.jvarkit.pedigree.*;
10 import com.github.lindenb.jvarkit.util.bio.fasta.FastaSequence;
11 import com.github.lindenb.jvarkit.math.RangeOfIntegers;
12 import com.github.lindenb.jvarkit.math.RangeOfDoubles;
13 import com.github.lindenb.jvarkit.util.Counter;
14 /** begin user's packages */
15 /** end user's packages */
16 @javax.annotation.processing.Generated(value="BioAlcidaeJdk",date="2020-07-13T19:35:26-0400")
17 public class BioAlcidaeJdkCustom190253546 extends com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.VcfHandler {
18 public BioAlcidaeJdkCustom190253546() {
19 }
20 @Override
21 public void execute() throws Exception {
22 // user's code starts here
23 stream().flatMap(V->V.getGenotypes().stream().filter(G->G.isCalled()&&!G.isHomRef()).map(G->V.getContig()+"\t"+G.getSampleName())).collect(Collectors.groupingBy(Function.identity(),Collectors.counting())).forEach((K,C)->println(K+":"+C));
24 //user's code ends here
25 }
26 }
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by htsjdk.samtools.util.CloserUtil (file:/home/kris/jvarkit/dist/bioalcidaejdk.jar) to method com.sun.xml.internal.stream.XMLEventReaderImpl.close()
WARNING: Please consider reporting this to the maintainers of htsjdk.samtools.util.CloserUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[WARN][SnpEffPredictionParser]no INFO[EFF] or no description. This VCF was probably NOT annotated with SnpEff (old version). But it's not a problem if this tool doesn't need to access SnpEff Annotations.
[WARN][VepPredictionParser]NO INFO[CSQ] found in header. This VCF was probably NOT annotated with VEP. But it's not a problem if this tool doesn't need to access VEP Annotations.
[WARN][AnnPredictionParser]no INFO[ANN] or no description This VCF was probably NOT annotated with SnpEff(ANN version) . But it's not a problem if this tool doesn't need to access SnpEff Annotations.
NC_044378.1 Can_Top.sort:280
NC_044374.1 Can_Bottom.sort:532
NC_044375.1 Can_Mid.sort:406
NW_022060378.1 Can_Top.sort:2
NW_022060402.1 Can_Bottom.sort:1
NW_022060485.1 Can_Top.sort:1
NC_044375.1 Can_Bottom.sort:410
NW_022060494.1 Can_Top.sort:1
NW_022060455.1 Can_Bottom.sort:1
NC_044378.1 Can_Mid.sort:307
Second code example:
[DEBUG][BioAlcidaeJdk] Compiling :
1 import java.util.*;
2 import java.util.stream.*;
3 import java.util.regex.*;
4 import java.util.function.*;
5 import htsjdk.samtools.*;
6 import htsjdk.samtools.util.*;
7 import htsjdk.variant.variantcontext.*;
8 import htsjdk.variant.vcf.*;
9 import com.github.lindenb.jvarkit.pedigree.*;
10 import com.github.lindenb.jvarkit.util.bio.fasta.FastaSequence;
11 import com.github.lindenb.jvarkit.math.RangeOfIntegers;
12 import com.github.lindenb.jvarkit.math.RangeOfDoubles;
13 import com.github.lindenb.jvarkit.util.Counter;
14 /** begin user's packages */
15 /** end user's packages */
16 @javax.annotation.processing.Generated(value="BioAlcidaeJdk",date="2020-07-13T19:16:03-0400")
17 public class BioAlcidaeJdkCustom2039591783 extends com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.VcfHandler {
18 public BioAlcidaeJdkCustom2039591783() {
19 }
20 @Override
21 public void execute() throws Exception {
22 // user's code starts here
23 final Map<String,Long> count=new HashMap<>(); stream().forEach(V->{for(int i=0;i +1 < V.getNSamples();i++) for(int j=i+1;j < V.getNSamples();j++) {final Genotype gi= V.getGenotype(i);final Genotype gj= V.getGenotype(j); if(gi.sameGenotype(gj)) {String key=gi.getSampleName()+"_"+gj.getSampleName(); long n = 1L + (count.containsKey(key)?count.get(key):0L); count.put(key,n); };} }); count.forEach((K,V)->println(K+" "+V));
24 //user's code ends here
25 }
26 }
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by htsjdk.samtools.util.CloserUtil (file:/home/kris/jvarkit/dist/bioalcidaejdk.jar) to method com.sun.xml.internal.stream.XMLEventReaderImpl.close()
WARNING: Please consider reporting this to the maintainers of htsjdk.samtools.util.CloserUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[WARN][SnpEffPredictionParser]no INFO[EFF] or no description. This VCF was probably NOT annotated with SnpEff (old version). But it's not a problem if this tool doesn't need to access SnpEff Annotations.
[WARN][VepPredictionParser]NO INFO[CSQ] found in header. This VCF was probably NOT annotated with VEP. But it's not a problem if this tool doesn't need to access VEP Annotations.
[WARN][AnnPredictionParser]no INFO[ANN] or no description This VCF was probably NOT annotated with SnpEff(ANN version) . But it's not a problem if this tool doesn't need to access SnpEff Annotations.
Can_Bottom.sort_Can_Mid.sort 2563
Can_Mid.sort_Can_Top.sort 2563
Can_Bottom.sort_Can_Top.sort 2563
the first code counts the genotypes carrying a ALT variant per chromosome per sample
the second code count the nomber of time the genotype for a sample is the same that for another sample.