How to get sample names and genotype for SNP in multi-sample VCF file
3
12
Entering edit mode
6.8 years ago
hellbio ▴ 520

Dear all,

I have a multi-sample (>100) VCF file and a list of SNPs with CHR and POS for which the sample names and their genotypes need to be extracted.

For example, i have a SNP chr1:23455. I would like to know how many and which samples have homozygous alternate allele, heterozygous allele and homozygous reference allele.

Is there any existing tool that provides the above summary?

Many thanks!!!

SNP genotype GATK • 16k views
ADD COMMENT
34
Entering edit mode
6.8 years ago

Last update: May 9, 2021

You must normalise your VCF / BCF first; otherwise, this script will not work as expected. You can do this with:

bcftools norm -m-any MyVariants.vcf -Ov > MyVariants.Norm.vcf

I probably should explain what's going on here, too: It is divided into 4 parts (each part indicated by the starting <(bcftools on each line):

  1. First awk command: outputs the first few columns from the VCF/BCF

  2. Three BCFtools query commands: tabulate the number of samples having each variant type (as you requested). This will work for phased and/or un-phased variants. The output is the 3 columns named nHet, nHomAlt, nHomRef

  3. Three BCFtools view commands: look through the file again, saving sample names into an array called 'header', and then printing the indices of the array where a particular field (i.e. sample) has a particular type of variant. This is repeated for: a, heterozygous (HetSamples), b, homozygous (alt) (HomSamplesAlt), and c, homozygouse (ref) (HomSamplesRef).

I've tested it and verified results on a handful of 1000 Genome variants. I strongly encourage you to do some rigorous testing.

paste <(bcftools view MyVariants.Norm.vcf |\
    awk -F"\t" 'BEGIN {print "CHR\tPOS\tID\tREF\tALT"} \
      !/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5}') \
    \
  <(bcftools query -f '[\t%SAMPLE=%GT]\n' MyVariants.Norm.vcf |\
    awk 'BEGIN {print "nHet"} {print gsub(/0\|1|1\|0|0\/1|1\/0/, "")}') \
    \
  <(bcftools query -f '[\t%SAMPLE=%GT]\n' MyVariants.Norm.vcf |\
    awk 'BEGIN {print "nHomAlt"} {print gsub(/1\|1|1\/1/, "")}') \
    \
  <(bcftools query -f '[\t%SAMPLE=%GT]\n' MyVariants.Norm.vcf |\
    awk 'BEGIN {print "nHomRef"} {print gsub(/0\|0|0\/0/, "")}') \
    \
  <(bcftools view MyVariants.Norm.vcf | awk -F"\t" '/^#CHROM/ {split($0, header, "\t"); print "HetSamples"} \
    !/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/0\|1|1\|0|0\/1|1\/0/, "", $(i))==1) {printf header[i]","}; if (i==NF) {printf "\n"}}}') \
    \
  <(bcftools view MyVariants.Norm.vcf | awk -F"\t" '/^#CHROM/ {split($0, header, "\t"); print "HomSamplesAlt"} \
    !/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/1\|1|1\/1/, "", $(i))==1) {printf header[i]","}; if (i==NF) {printf "\n"}}}') \
    \
  <(bcftools view MyVariants.Norm.vcf | awk -F"\t" '/^#CHROM/ {split($0, header, "\t"); print "HomSamplesRef"} \
    !/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/0\|0|0\/0/,"", $(i))==1) {printf header[i]","}; if (i==NF) {printf "\n"}}}') \
    \
  | sed 's/,\t/\t/g' | sed 's/,$//g'


CHR POS     ID                  REF ALT nHet nHomAlt nHomRef    HetSamples  HomSamplesAlt   HomSamplesRef
1   10177   1:10177:10177:A:AC  A   AC  4   0   1   HG00096,HG00097,HG00099,HG00100     HG00101
1   10235   1:10235:10235:T:TA  T   TA  0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   10352   1:10352:10352:T:TA  T   TA  5   0   0   HG00096,HG00097,HG00099,HG00100,HG00101     
1   10616   1:10616:10616:CCGCCGTTGCAAAGGCGCGCCG:C  CCGCCGTTGCAAAGGCGCGCCG  C   0   5   0       HG00096,HG00097,HG00099,HG00100,HG00101 
1   10642   1:10642:10642:G:A   G   A   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   11008   1:11008:11008:C:G   C   G   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   11012   1:11012:11012:C:G   C   G   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   11063   1:11063:11063:T:G   T   G   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   13110   1:13110:13110:G:A   G   A   1   0   4   HG00097     HG00096,HG00099,HG00100,HG00101
1   13116   1:13116:13116:T:G   T   G   2   0   3   HG00097,HG00101     HG00096,HG00099,HG00100
1   13118   1:13118:13118:A:G   A   G   2   0   3   HG00097,HG00101     HG00096,HG00099,HG00100
1   13273   1:13273:13273:G:C   G   C   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   13284   1:13284:13284:G:A   G   A   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   13380   1:13380:13380:C:G   C   G   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   13483   1:13483:13483:G:C   G   C   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   13494   1:13494:13494:A:G   A   G   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   13550   1:13550:13550:G:A   G   A   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   14464   1:14464:14464:A:T   A   T   1   1   3   HG00099 HG00096 HG00097,HG00100,HG00101
1   14599   1:14599:14599:T:A   T   A   2   0   3   HG00097,HG00099     HG00096,HG00100,HG00101
1   14604   1:14604:14604:A:G   A   G   2   0   3   HG00097,HG00099     HG00096,HG00100,HG00101
1   14930   1:14930:14930:A:G   A   G   5   0   0   HG00096,HG00097,HG00099,HG00100,HG00101     
1   14933   1:14933:14933:G:A   G   A   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   15211   1:15211:15211:T:G   T   G   5   0   0   HG00096,HG00097,HG00099,HG00100,HG00101     
1   15245   1:15245:15245:C:T   C   T   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   15274   1:15274:15274:A:G   A   G   3   0   2   HG00096,HG00100,HG00101     HG00097,HG00099
1   15274   1:15274:15274:A:T   A   T   3   2   0   HG00096,HG00100,HG00101 HG00097,HG00099 
ADD COMMENT
0
Entering edit mode

Dear Kevin/Sam, these are exactly what I need. However, I am wondering how to output the results into a file (e.g. a text file) instead of printing all of them in the screen. I am sorry that I am a true beginner of this field.

ADD REPLY
0
Entering edit mode

Hey, you just need to use the '>' character, which instructs the shell to re-direct the output of the command. The default is to send the command output to STDOUT, i.e., the screen.

For example, look at these examples:

echo "test"
echo "test" > out.out

What is contained in out.out? Now do the same to the other commands that you are using.

ADD REPLY
1
Entering edit mode

Wonderful! This works out perfectly! I am able to save it into a text file by adding '>' at the end of the code. Thanks a million for your help! I have been bothered by the this task (get variant counts (0/1 or 1/1) per SNP and print the sample ID out) for the whole weekend. Finally I found this poster which saved my life!

ADD REPLY
0
Entering edit mode

Okay - great. If you have multi-allelic records in your VCF/BCF, though, then you should split these before running the above script. Multi-allelic records may look like this:

5  166777  rs3811360  C  A,T  .  PASS  .  GT  1/2

These should be split with bcftools norm -m-any and would then become:

5  166777  rs3811360  C  A  .  PASS  .  GT  0/1
5  166777  rs3811360  C  T  .  PASS  .  GT  0/1

Also, please note that the script works for both phased and / or un-phased genotypes. Please check some of your output to ensure that it does exactly what you need.

ADD REPLY
0
Entering edit mode

Hi Kevin,

I used your code with some modification to add AC,AN, and AF as I change (bcftools query -f '[\t%SAMPLE=%GT]\n MyVariants.Norm.vcf to bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%AC\t%AN\t%AF[\t%SAMPLE=%GT]\n' MyVariants.Norm.vcf | and also removing the first part

paste <(bcftools view MyVariants.Norm.vcf |\
    awk -F"\t" 'BEGIN {print "CHR\tPOS\tID\tREF\tALT"} \
      !/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5}') \)

But it neither give me an error nor output. Could you please let me know is the modification right?

Thanks

ADD REPLY
0
Entering edit mode

Hi, I have noticed this code gives some non corrisponding sample numbers and names...

ADD REPLY
0
Entering edit mode

Hi Kevin, this is great. It helped me with a previous analysis. How can we get each sample with their respective genotype nHet, nHomAlt, nHomRef (or Aa, aa, AA), for each SNP?

ADD REPLY
0
Entering edit mode

Hi! I'm trying to parallelize your code (to run on a bunch of files), but it is not working because of multiple processes spawning. Do you have any suggestions on how to parallelize? I tried with for loops and parallel without success

ADD REPLY
1
Entering edit mode

It is difficult for me to comment, as I am not too informed on your IT infrastructure. What I can say is that it would be better to run these piped commands on a single CPU core and to just be patient.

ADD REPLY
0
Entering edit mode

Do you have any suggestions on how to run the commands on a single core for all files in a folder? A simple for loop (like for i in /folder/*vcf) is spawing all processes in a row for many cores, even if I add sleep. Im currently using a centos server without slurm. I usually just use parallel, but this is also spawning hundreds of processes even if I use parallel -j 1

ADD REPLY
0
Entering edit mode

I ended up using a function to run parallel. I also made a small change to reduce the number of child processes (each < creates a new process. In theory, your code could be further optimized to make each unique bcftools call unified with only one awk. I did this with the first part of counting the genotypes)

myfunc() {
  parallelinput="$1"
  #strip prefix
  filenames1=( "${parallelinput##*/}" )



   paste <(bcftools view "$1" | awk -F"\t" 'BEGIN {print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"} !/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8}') <(bcftools query -f '[\t%SAMPLE=%GT]\n' "$1"| awk 'BEGIN {OFS="\t"; print "nHomAlt\tnHet\tnHomRef"} {nHet=gsub(/0\|1|1\|0|0\/1|1\/0/, ""); nHomAlt=sub(/1\|1|1\/1/, ""); nHomRef=gsub(/0\|0|0\/0/, ""); print nHomAlt,nHet,nHomRef}') <(bcftools view "$1" | awk -F"\t" '/^#CHROM/ {split($0, header, "\t"); print "HetSamples"} !/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/0\|1|1\|0|0\/1|1\/0/, "", $(i))==1) {printf header[i]","}; if (i==NF) {printf "\n"}}}') <(bcftools view "$1" | awk -F"\t" '/^#CHROM/ {split($0, header, "\t"); print "HomSamplesAlt"} !/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/1\|1|1\/1/, "", $(i))==1) {printf header[i]","}; if (i==NF) {printf "\n"}}}') <(bcftools view "$1" | awk -F"\t" '/^#CHROM/ {split($0, header, "\t"); print "HomSamplesRef"} !/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/0\|0|0\/0/,"", $(i))==1) {printf header[i]","}; if (i==NF) {printf "\n"}}}') | sed 's/,\t/\t/g' | sed 's/,$//g' >> /home/count/"$filenames1"_count.vcf
    }
    export -f myfunc

    parallel -j 1 myfunc  {}  ::: /home/vcf/*vcf
ADD REPLY
0
Entering edit mode

Great work - thank you for posting the follow-up.

ADD REPLY
4
Entering edit mode
6.4 years ago

use bioalcidaejdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html and the following script:

// all the types of Genotypes see https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/variantcontext/GenotypeType.html
final List<GenotypeType> all_types = Arrays.asList(GenotypeType.values());
//print the header
println("CHROM\tPOS\tREF\tALT\t"+all_types.stream().map(T->"count."+T.name()+"\tsample."+T.name()).collect(Collectors.joining("\t")));
//for each variant
stream().forEach(V->println(
    //print the contig
    V.getContig()+"\t"+
    // print the position
    V.getStart()+"\t"+
    // print the REF alleles
    V.getReference().getDisplayString()+"\t"+
    // print the ALT alleles
    V.getAlternateAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining(","))+"\t"+
    // loop over each type
    all_types.stream().map(T->

        String.valueOf(
            // collect all genotypes
            V.getGenotypes().stream().
            //genotype must have type T
            filter(G->G.getType().equals(T)).
            count()) +
        "\t"+
        // collect all genotypes
        V.getGenotypes().stream().
            //genotype must have type T
            filter(G->G.getType().equals(T)).
            //convert to sample name
            map(G->G.getSampleName()).
            // join string
            collect(Collectors.joining(","))
        ).
        collect(Collectors.joining("\t")))
    );

a one-liner:

$ java -jar dist/bioalcidaejdk.jar -e 'final List<GenotypeType> all_types = Arrays.asList(GenotypeType.values());println("CHROM\tPOS\tREF\tALT\t"+all_types.stream().map(T->"count."+T.name()+"\tsample."+T.name()).collect(Collectors.joining("\t")));stream().forEach(V->println( V.getContig()+"\t"+ V.getStart()+"\t"+ V.getReference().getDisplayString()+"\t"+ V.getAlternateAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining(","))+"\t"+ all_types.stream().map(T-> String.valueOf( V.getGenotypes().stream(). filter(G->G.getType().equals(T)). count()) + "\t"+ V.getGenotypes().stream(). filter(G->G.getType().equals(T)). map(G->G.getSampleName()). collect(Collectors.joining(",")) ). collect(Collectors.joining("\t"))) );'  src/test/resources/rotavirus_rf.vcf.gz 


CHROM   POS REF ALT count.NO_CALL   sample.NO_CALL  count.HOM_REF   sample.HOM_REF  count.HET   sample.HET  count.HOM_VAR   sample.HOM_VAR  count.UNAVAILABLEsample.UNAVAILABLE count.MIXED sample.MIXED
RF01    970 A   C   0       4   S1,S2,S3,S4 0       1   S5  0       0   
RF02    251 A   T   0       3   S1,S4,S5    2   S2,S3   0       0       0   
RF02    578 G   A   0       4   S1,S2,S3,S5 0       1   S4  0       0   
RF02    877 T   A   0       4   S2,S3,S4,S5 1   S1  0       0       0   
RF02    1726    T   G   0       3   S1,S4,S5    2   S2,S3   0       0       0   
RF02    1962    TACA    TA  0       4   S2,S3,S4,S5 1   S1  0       0       0   
RF03    1221    C   G   0       3   S1,S4,S5    0       2   S2,S3   0       0   
RF03    1242    C   A   0       4   S1,S2,S3,S5 0       1   S4  0       0   
RF03    1688    T   G   0       4   S1,S2,S3,S4 0       1   S5  0       0   
ADD COMMENT
0
Entering edit mode

@Pierre, could this also get INFO column in the VCF along with CHROM,POS,REF and ALT?

ADD REPLY
0
Entering edit mode

@pierre I have played around the command V.getAttribute and figured out that it needs a key which is to be printed from the INFO column. For example it prints the value associated with "INTERNAL" in the INFO filed by V.getAttribute("INTERNAL"). However, i would like to print all the keys and its values from the INFO column, while each row may have different key's sometimes.

ADD REPLY
0
Entering edit mode

in other words, simply print the INFO column to the output along with CHR, POS, ID, REF, ALT, INFO, count.NOCALL.... as it is in the VCF file.

ADD REPLY
0
Entering edit mode
variant.getAttributes().entrySet().stream().forEach(KV->print(KV.getKey()+" "+KV.getValue()))
ADD REPLY
0
Entering edit mode

java -jar bioalcidaejdk.jar -e 'final List<genotypetype> all_types = Arrays.asList(GenotypeType.values()); println("CHROM\tPOS\tREF\tALT\tINFO\t"+all_types.stream().map(T->"count."+T.name()+"\tsample."+T.name()).collect(Collectors.joining("\t"))); stream().forEach(V->println( V.getContig()+"\t"+ V.getStart()+"\t"+ V.getReference().getDisplayString()+"\t"+ V.getAlternateAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining(","))+"\t"+ V.getAttributes().entrySet().stream().forEach(KV->print(KV.getKey()+" "+KV.getValue())).collect(Collectors.joining(","))+"\t" +all_types.stream().map(T-> String.valueOf( V.getGenotypes().stream(). filter(G->G.getType().equals(T)). count()) + "\t"+ V.getGenotypes().stream(). filter(G->G.getType().equals(T)). map(G->G.getSampleName()). collect(Collectors.joining(",")) ). collect(Collectors.joining("\t"))) );' input.vcf

This gives the error:

1 error [SEVERE][BioAlcidaeJdk]java.lang.RuntimeException: java.lang.RuntimeException: Cannot compile java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: Cannot compile at com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk$AbstractHandlerFactory.getConstructor(BioAlcidaeJdk.java:906) at com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk$VcfHandlerFactory.execute(BioAlcidaeJdk.java:937) at com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.doWork(BioAlcidaeJdk.java:1377) at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:760) at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:923) at com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.main(BioAlcidaeJdk.java:1402) Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Cannot compile at com.github.lindenb.jvarkit.lang.OpenJdkCompiler$DefaultOpenJdkCompiler.compileClass(OpenJdkCompiler.java:266) at com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk$AbstractHandlerFactory.getConstructor(BioAlcidaeJdk.java:898) ... 5 more Caused by: java.lang.RuntimeException: Cannot compile at com.github.lindenb.jvarkit.lang.OpenJdkCompiler$DefaultOpenJdkCompiler.exec(OpenJdkCompiler.java:180) at com.github.lindenb.jvarkit.lang.OpenJdkCompiler$DefaultOpenJdkCompiler.compileClass(OpenJdkCompiler.java:243) ... 6 more [INFO][Launcher]bioalcidaejdk Exited with failure (-1)

is the syntax somewhere incorrect.

ADD REPLY
0
Entering edit mode

@_r_am it does give the same error.

ADD REPLY
0
Entering edit mode

Completely got blind to find the source of error. How could we add this code to print INFO to the original code in a correct manner without the error.

ADD REPLY
0
Entering edit mode

The above command/script works well but fails when there is spanning deletions marked with * in the ALT field.

[SEVERE][BioAlcidaeJdk]The provided VCF file is malformed at approximately line number 11883: unparsable vcf record with allele *TGTGTGTGTGTGTGTGTGTGTGTGTGTGT

Could this allow such calls?

ADD REPLY
0
Entering edit mode

Could this allow such calls?

no, it's not supported by the hsjdk library. https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/variant/variantcontext/Allele.java#L385

ADD REPLY
4
Entering edit mode
6.8 years ago
Ram 44k

I'd use bcftools query to output a file containing %GT, and then process that file using R or python or even awk

ADD COMMENT
0
Entering edit mode

Yes, take a look at my previous answer here: A: calculate Per variant Heterozygosity from VCF file

ADD REPLY
0
Entering edit mode

So,bcftools norm -m - -Ou $VCF | your_one_liner?

ADD REPLY
0
Entering edit mode

Something along those lines - yes. It is critical to normalise the VCF first.

Also, this does not output sample names that have each genotype. That would require extra coding

ADD REPLY
1
Entering edit mode

Sample names with each GT would be bcftools query -f "... %GT"

ADD REPLY
0
Entering edit mode

do you mean that it should be done in steps? to get counts and sample names?

ADD REPLY
0
Entering edit mode

From what i can see, it would be better to break it up into different steps and then merge the output. I know how to do it but don't have time right now. I may try later today. Of course, somebody else may have a solution by then.

ADD REPLY
0
Entering edit mode

Yep, you'll need at least 2 - a "map" step to get genotypes for each individual at each site and a "reduce" step, to get counts or a CSV-list of per-genotype individuals.

ADD REPLY
0
Entering edit mode

I've posted an answer below, in part because you'll appreciate the greater room in viewing the code. Thanks to Ram for input (upvote)

ADD REPLY

Login before adding your answer.

Traffic: 2463 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6