Question

How to get sample names based on genotype from multi-sample vcf file

0

Entering edit mode

9.5 years ago

hellbio ▴ 520

Hi,

I have multi-sample vcf file and an example variant is shown below:

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    03-071    04-051    04-071    06-044    07-085    10-009
chr1    6526093    .    T    C    197.77    .    AC1=1;AC=1;AF1=0.5    GT:GQ:DP:PL:AD    0/1    1/1    0/0    1/1    1/1    0/1

For each variant I would need to retrieve the sample names based on genotype. If the genotype is 0/1 it should output the first 5 columns and the sample names in the 6th column.

chr1    6526093    .    T    C    03-071,10-009

If the genotype is 1/1 it should output:

chr1    6526093    .    T    C    04-051,06-044,07-085

The original file has >500 sample and I would need to get the output in the above format. Are there any tools which can do this to some extent and further tweaking to get the desire output format?

vcf • 4.2k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.5 years ago by hellbio ▴ 520

Ram · Answer 1 · 2015-11-25

0

Entering edit mode

9.5 years ago

Pierre Lindenbaum 166k

using my tool BioAlcidae

the script:

while(iter.hasNext())
    {
    var ctx = iter.next();
    for(var i = 0;i< ctx.getNSamples();++i)
        {
        var g = ctx.getGenotype(i);        
        out.print(ctx.contig);
        out.print("\t");
        out.print(ctx.start);
         out.print("\t");
        out.print(ctx.getID());
        out.print("\t"+g.getSampleName());
        var alleles = g.getAlleles();
        for(var j=0;j< alleles.size();++j)
            {
            out.print("\t"+alleles.get(j).getDisplayString());
            }
         out.println();
        }
    }

the vcf:

##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1,length=249250621,assembly=hg19>
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    S1    S2    S3    S4    S5    S6    S7    S8
chr1    1    .    T    A,C    .    .    .    GT    0/0    0/1    1/1    1/2    2/2    0/2    ./.    2/1

run

java -jar dist-1.139/bioalcidae.jar -f script.js input.vcf

chr1    1    .    S1    T    T
chr1    1    .    S2    T    A
chr1    1    .    S3    A    A
chr1    1    .    S4    A    C
chr1    1    .    S5    C    C
chr1    1    .    S6    T    C
chr1    1    .    S7        
chr1    1    .    S8    C    A

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks. This looks like printing each variant for each sample which would be very large output especially for WGS vcf file with 500 samples as it will print 500 rows for 1 variant. Instead, it would be helpful if it could print only the variant in single row and add additional columns fwith sample names to represent hom/het variants.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by hellbio ▴ 520

0

Entering edit mode

so the script:

while(iter.hasNext())
    {
    var ctx = iter.next();
    var samples={};
    for(var i = 0;i< ctx.getNSamples();++i)
        {

        var gi = ctx.getGenotype(i);
        if( gi.getSampleName() in samples ) continue;
        out.print(ctx.contig);
        out.print("\t");
        out.print(ctx.start);
         out.print("\t");
        out.print(ctx.getID());
        out.print("\t");
        out.print(gi.getSampleName());
        samples[gi.getSampleName()]=1;
        for(var j=i+1;j<ctx.getNSamples();++j)
            {
            var gj = ctx.getGenotype(j);
            if( !(gi.sameGenotype(gj))) continue;
            samples[gj.getSampleName()]=1;
            out.print(";"+gj.getSampleName());
            }
        var alleles = gi.getAlleles();
        for(var j=0;j< alleles.size();++j)
            {
            out.print("\t"+alleles.get(j).getDisplayString());
            }
         out.println();
        }

    }

output:

chr1    1    .    S1    T    T
chr1    1    .    S2    T    A
chr1    1    .    S3    A    A
chr1    1    .    S4;S8    A    C
chr1    1    .    S5    C    C
chr1    1    .    S6    T    C
chr1    1    .    S7

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks for you time. Could we achieve it this way.

chr1    1    .    S2,S4,S6,S8   S3,S5

The script need not bother about the multiple alternate alleles. It only has to consider 0/1 or 1/1 or 1/2 or 2/2. The above output has heterozygote sample names in 4th column and homozygote sample names in 5th column.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by hellbio ▴ 520

0

Entering edit mode

Could we just save the updated script and run or should we download the tool and install?

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by hellbio ▴ 520

0

Entering edit mode

uh ??

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

this is just a simple loop. I let this as an exercise...

ADD REPLY • link 9.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I tried to download and install but unsuccessful.

$ git clone "https://github.com/lindenb/jvarkit.git"
cd jvarkit/
make bioalcidae

Terminated with the below error:

(cd /Users/aru/Downloads/jvarkit/htsjdk-1.139 && ant )
/bin/bash: ant: command not found
make: *** [/Users/aru/Downloads/jvarkit/htsjdk-1.139/dist/htsjdk-1.139.jar] Error 127

Could you help to work through the tool.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by hellbio ▴ 520

0

Entering edit mode

ant is required https://github.com/lindenb/jvarkit/wiki/Compilation#requirements--dependencies

ADD REPLY • link 9.5 years ago by Pierre Lindenbaum 166k