Question

how to get the genotypes from vcf file.

2

Entering edit mode

8.8 years ago

fufuyou ▴ 110

Hi,

I have a vcf files with multiple samples. I want get the genotype data from the vcf file. I do not know how to do it. Could you tell me how I can get it.

Thanks,
Fuyou

Thank you everybody.

It is my fault I did not say clearly what I want to do.

For example, I have get my vcf file is

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1   sample 2      sample3
Chr01   1353    .       T       C       2385.77 PASS    FS=0;MQ=51.64;MQ0=0;QD=29.02;SOR=0.818;BaseQRankSum=1.468;ClippingRankSum=-0.886;MQRankSum=-2.172;ReadPosRankSum=-1.03;DP=666;AF=1;MLEAC=2;MLEAF=1;AN=70;AC=67  GT:AD:DP:GQ:PL  1/1:0,15:15:48:707,48,0 0/1:36,28:64:99:1147,0,11882   1/1:0,14:14:42:630,42,0 1/1:0,29:29:87:1299,87,0

I want to get a file like as:

#CHROM  POS     ID      REF     ALT    sample1   sample 2      sample3
Chr01        1353    .            T       C         CT              CC              CC

Thanks,
Fuyou

SNP VCF • 13k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 8.8 years ago by fufuyou ▴ 110

1

Entering edit mode

You're not giving us much to work with. What genotypes? Why do you want the genotypes?

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Thanks Zev,

It is my fault I did not say clearly what I want to do.

For example, I have get my vcf file is

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1   sample 2      sample3
Chr01   1353    .       T       C       2385.77 PASS    FS=0;MQ=51.64;MQ0=0;QD=29.02;SOR=0.818;BaseQRankSum=1.468;ClippingRankSum=-0.886;MQRankSum=-2.172;ReadPosRankSum=-1.03;DP=666;AF=1;MLEAC=2;MLEAF=1;AN=70;AC=67  GT:AD:DP:GQ:PL  1/1:0,15:15:48:707,48,0 0/1:36,28:64:99:1147,0,11882   1/1:0,14:14:42:630,42,0 1/1:0,29:29:87:1299,87,0

I want to get a file like as:

#CHROM  POS     ID      REF     ALT    sample1   sample 2      sample3
Chr01        1353    .            T       C         CT              CC              CC

Thanks,
Fuyou

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by fufuyou ▴ 110

0

Entering edit mode

Thank you everybody.

It is my fault I did not say clearly what I want to do.

For example, I have get my vcf file is

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1   sample 2      sample3
Chr01   1353    .       T       C       2385.77 PASS    FS=0;MQ=51.64;MQ0=0;QD=29.02;SOR=0.818;BaseQRankSum=1.468;ClippingRankSum=-0.886;MQRankSum=-2.172;ReadPosRankSum=-1.03;DP=666;AF=1;MLEAC=2;MLEAF=1;AN=70;AC=67  GT:AD:DP:GQ:PL  1/1:0,15:15:48:707,48,0 0/1:36,28:64:99:1147,0,11882   1/1:0,14:14:42:630,42,0 1/1:0,29:29:87:1299,87,0

I want to get a file like as:

#CHROM  POS     ID      REF     ALT    sample1   sample 2      sample3
Chr01        1353    .            T       C         CT              CC              CC

Thanks,
Fuyou

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by fufuyou ▴ 110

1

Entering edit mode

It is better if you include this as an Update to your original post and not add it as an answer (you can hit the edit button to edit your original post and add to the bottom, typically after putting in that it is an Update that follows). All you are doing is stripping out some fields. You can use linux tools to cut out columns from tabular data. Any of the programming options I mentioned in my post will also let you do this by iterating over the VCF records and only outputting the desired fields to a new output file.

ADD REPLY • link 8.8 years ago by DG 7.3k

0

Entering edit mode

Thank you very much!

ADD REPLY • link 8.8 years ago by fufuyou ▴ 110

0

Entering edit mode

If my answer solves your problem please upvote and accept as the answer. Much appreciated

ADD REPLY • link 8.8 years ago by DG 7.3k

Ram · Answer 1 · 2016-02-09

2

Entering edit mode

8.8 years ago

DG 7.3k

There are probably about a 1001 ways to do this, so which approach works best for you depends on what you want to do with the genotypes. You can use bcftools as suggested by Manuel (or vcftools, but bcftools is much faster. However output format can differ so sometimes one is preferred over the other depending on the desired format). If you want to do complex analyses with the genotypes you may want to write a script in python for instance, in which case PyVCF or the much faster cython implementation cyvcf would be really useful. There is also cyvcf2, which is even faster and written around htslib but has a very specific use-case it was designed for so it might not suit your purposes. And of course, because the data is columnar, you can always parse out the genotype columns using standard linux command line tools, parse the fields, and do something. But I typically recommend some of the specialized tools and coding modules that have been written, as they are much easier typically.

Update:

Using PyVCF or cyVCF in python you could do something like this:

import vcf
with open('output.txt', 'w') as output:
    vcf_reader = vcf.Reader(open('input.vcf', 'r'))
    for record in vcf_reader:
        output.write("{}\t{}\t{}\t{}\t{}\t{}".format(record.CHROM, record.POS. record.ID,                                                        record.REF, record.ALT))
        for sample in record.samples:
             output.write("\t{}".format(sample['GT']))
         output.write("\n")

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by DG 7.3k

0

Entering edit mode

@Dan Gaston:

This answer works well for one of the problem I have.

But, if say there is a PG field in FORMAT and it corresponding values in SAMPLE field. And, this PG isn't represented in everyline of the vcf how do I capture format(sample['PG']

I am getting following error:

AttributeError: 'CallData' object has no attribute 'PI'

How, do I invoke an exception and force the PI value to be period '.' when PI tag is missing?

ADD REPLY • link 7.9 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

This is a standard thing to do in Python. You can wrap your statement in a Try

try:
   format(sample['PG'])
except AttributeError:
   #do something

ADD REPLY • link 7.9 years ago by DG 7.3k

0

Entering edit mode

I was able to use except AttributeError: but the problem is when there is another tag which gives IndexError for some lines too. That way I have to make a combinations of except almost 4 times with different condition. And, if there are more tags that are missing in some parts of the vcf it would be unmanageable.

I was hoping there was a way something like this but it didn't work:

format(sample['PG'] else '.', sample ['GT'] else '.', sample['PI'] else '.') This way whereever a required tag/field is missing in FORMAT and SAMPLE field it would add it as default period (.). I tried to implement this in every possible way but I would always get no attribute 'PG' in _Call something something...

Problem 2: Another problem is getting the sample name and gt_bases in a for-loop. For sample in record.samples: called_name = record.genotype('sample_name') allele_base = called_name.gt_bases

Here sample_name should be given exclusively. Using sample['sample'] wouldn't work, and I would receive an errror.

So, I used:

    For sample in record.samples:
            called_name = (str(sample).split(' ')[0]).split('=')[1].strip(',')

This method would call the name in the for loop and let me print/write the value of allele-bases.

            allele_base = called_name.gt_bases

Is there a method more efficient for the above purpose in pyVCF module. I read through the documentation but couldn't fine one.

ADD REPLY • link 7.9 years ago by kirannbishwa01 ★ 1.6k

1

Entering edit mode

You can catch multiple exception types in a single exception. You can also catch ANY exception with just a bare except clause, but that is not considered to be good programming practice as any error will be caught by that exception block. But as you said, you are then still left with the problem of using a bunch of logic to figure out what needs to be replaced. Have you tried the or operator? I often do things like:

var = ",".join(list_of_vars) or None

to deal with things like missing lists or objects with nothing in a particular field. I suspect it will work here in your format strong and with using periods. Worth trying.

For your second problem record.samples is a list. You can also use the genotype() method to get a genotype for a specific sample using a name.

ADD REPLY • link 7.9 years ago by DG 7.3k

0

Entering edit mode

I will try the proposed solution for 1st problem.

Regarding 2nd problem, I had already tried what you suggested, but there's is a problem with the method. See this link: How to call genotype bases for all the samples?

ADD REPLY • link 7.9 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

Adding an answer in that question

ADD REPLY • link 7.9 years ago by DG 7.3k

Ram · Answer 2 · 2016-02-08

1

Entering edit mode

8.8 years ago

MAPK ★ 2.1k

I think you just want the GT column along with the AD column. I would also get all the columns including REF and ALT and make a nicely formatted tab separated table structure using R scripts.

Take a look at this:

http://samtools.github.io/hts-specs/VCFv4.1.pdf

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by MAPK ★ 2.1k

score 1 · Answer 3 · 2016-02-09

1

Entering edit mode

8.8 years ago

Manuel Landesfeind ★ 1.4k

See Query EXAMPLES at https://samtools.github.io/bcftools/bcftools.html#query

ADD COMMENT • link 8.8 years ago by Manuel Landesfeind ★ 1.4k