how to get the genotypes from vcf file.
3
2
Entering edit mode
8.8 years ago
fufuyou ▴ 110

Hi,

I have a vcf files with multiple samples. I want get the genotype data from the vcf file. I do not know how to do it. Could you tell me how I can get it.

Thanks,
Fuyou

Thank you everybody.

It is my fault I did not say clearly what I want to do.

For example, I have get my vcf file is

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1   sample 2      sample3
Chr01   1353    .       T       C       2385.77 PASS    FS=0;MQ=51.64;MQ0=0;QD=29.02;SOR=0.818;BaseQRankSum=1.468;ClippingRankSum=-0.886;MQRankSum=-2.172;ReadPosRankSum=-1.03;DP=666;AF=1;MLEAC=2;MLEAF=1;AN=70;AC=67  GT:AD:DP:GQ:PL  1/1:0,15:15:48:707,48,0 0/1:36,28:64:99:1147,0,11882   1/1:0,14:14:42:630,42,0 1/1:0,29:29:87:1299,87,0

I want to get a file like as:

#CHROM  POS     ID      REF     ALT    sample1   sample 2      sample3
Chr01        1353    .            T       C         CT              CC              CC

Thanks,
Fuyou

SNP VCF • 13k views
ADD COMMENT
1
Entering edit mode

You're not giving us much to work with. What genotypes? Why do you want the genotypes?

ADD REPLY
0
Entering edit mode

Thanks Zev,

It is my fault I did not say clearly what I want to do.

For example, I have get my vcf file is

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1   sample 2      sample3
Chr01   1353    .       T       C       2385.77 PASS    FS=0;MQ=51.64;MQ0=0;QD=29.02;SOR=0.818;BaseQRankSum=1.468;ClippingRankSum=-0.886;MQRankSum=-2.172;ReadPosRankSum=-1.03;DP=666;AF=1;MLEAC=2;MLEAF=1;AN=70;AC=67  GT:AD:DP:GQ:PL  1/1:0,15:15:48:707,48,0 0/1:36,28:64:99:1147,0,11882   1/1:0,14:14:42:630,42,0 1/1:0,29:29:87:1299,87,0

I want to get a file like as:

#CHROM  POS     ID      REF     ALT    sample1   sample 2      sample3
Chr01        1353    .            T       C         CT              CC              CC

Thanks,
Fuyou

ADD REPLY
0
Entering edit mode

Thank you everybody.

It is my fault I did not say clearly what I want to do.

For example, I have get my vcf file is

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1   sample 2      sample3
Chr01   1353    .       T       C       2385.77 PASS    FS=0;MQ=51.64;MQ0=0;QD=29.02;SOR=0.818;BaseQRankSum=1.468;ClippingRankSum=-0.886;MQRankSum=-2.172;ReadPosRankSum=-1.03;DP=666;AF=1;MLEAC=2;MLEAF=1;AN=70;AC=67  GT:AD:DP:GQ:PL  1/1:0,15:15:48:707,48,0 0/1:36,28:64:99:1147,0,11882   1/1:0,14:14:42:630,42,0 1/1:0,29:29:87:1299,87,0

I want to get a file like as:

#CHROM  POS     ID      REF     ALT    sample1   sample 2      sample3
Chr01        1353    .            T       C         CT              CC              CC

Thanks,
Fuyou

ADD REPLY
1
Entering edit mode

It is better if you include this as an Update to your original post and not add it as an answer (you can hit the edit button to edit your original post and add to the bottom, typically after putting in that it is an Update that follows). All you are doing is stripping out some fields. You can use linux tools to cut out columns from tabular data. Any of the programming options I mentioned in my post will also let you do this by iterating over the VCF records and only outputting the desired fields to a new output file.

ADD REPLY
0
Entering edit mode

Thank you very much!

ADD REPLY
0
Entering edit mode

If my answer solves your problem please upvote and accept as the answer. Much appreciated

ADD REPLY
2
Entering edit mode
8.8 years ago
DG 7.3k

There are probably about a 1001 ways to do this, so which approach works best for you depends on what you want to do with the genotypes. You can use bcftools as suggested by Manuel (or vcftools, but bcftools is much faster. However output format can differ so sometimes one is preferred over the other depending on the desired format). If you want to do complex analyses with the genotypes you may want to write a script in python for instance, in which case PyVCF or the much faster cython implementation cyvcf would be really useful. There is also cyvcf2, which is even faster and written around htslib but has a very specific use-case it was designed for so it might not suit your purposes. And of course, because the data is columnar, you can always parse out the genotype columns using standard linux command line tools, parse the fields, and do something. But I typically recommend some of the specialized tools and coding modules that have been written, as they are much easier typically.

Update:

Using PyVCF or cyVCF in python you could do something like this:

import vcf
with open('output.txt', 'w') as output:
    vcf_reader = vcf.Reader(open('input.vcf', 'r'))
    for record in vcf_reader:
        output.write("{}\t{}\t{}\t{}\t{}\t{}".format(record.CHROM, record.POS. record.ID,                                                        record.REF, record.ALT))
        for sample in record.samples:
             output.write("\t{}".format(sample['GT']))
         output.write("\n")
ADD COMMENT
0
Entering edit mode

@Dan Gaston:

This answer works well for one of the problem I have.

But, if say there is a PG field in FORMAT and it corresponding values in SAMPLE field. And, this PG isn't represented in everyline of the vcf how do I capture format(sample['PG']

I am getting following error:

AttributeError: 'CallData' object has no attribute 'PI'

How, do I invoke an exception and force the PI value to be period '.' when PI tag is missing?

ADD REPLY
0
Entering edit mode

This is a standard thing to do in Python. You can wrap your statement in a Try

try:
   format(sample['PG'])
except AttributeError:
   #do something
ADD REPLY
0
Entering edit mode

I was able to use except AttributeError: but the problem is when there is another tag which gives IndexError for some lines too. That way I have to make a combinations of except almost 4 times with different condition. And, if there are more tags that are missing in some parts of the vcf it would be unmanageable.

I was hoping there was a way something like this but it didn't work:

format(sample['PG'] else '.', sample ['GT'] else '.', sample['PI'] else '.') This way whereever a required tag/field is missing in FORMAT and SAMPLE field it would add it as default period (.). I tried to implement this in every possible way but I would always get no attribute 'PG' in _Call something something...

Problem 2: Another problem is getting the sample name and gt_bases in a for-loop. For sample in record.samples: called_name = record.genotype('sample_name') allele_base = called_name.gt_bases

Here sample_name should be given exclusively. Using sample['sample'] wouldn't work, and I would receive an errror.

So, I used:

    For sample in record.samples:
            called_name = (str(sample).split(' ')[0]).split('=')[1].strip(',')

This method would call the name in the for loop and let me print/write the value of allele-bases.

            allele_base = called_name.gt_bases

Is there a method more efficient for the above purpose in pyVCF module. I read through the documentation but couldn't fine one.

ADD REPLY
1
Entering edit mode

You can catch multiple exception types in a single exception. You can also catch ANY exception with just a bare except clause, but that is not considered to be good programming practice as any error will be caught by that exception block. But as you said, you are then still left with the problem of using a bunch of logic to figure out what needs to be replaced. Have you tried the or operator? I often do things like:

var = ",".join(list_of_vars) or None

to deal with things like missing lists or objects with nothing in a particular field. I suspect it will work here in your format strong and with using periods. Worth trying.

For your second problem record.samples is a list. You can also use the genotype() method to get a genotype for a specific sample using a name.

ADD REPLY
0
Entering edit mode

I will try the proposed solution for 1st problem.

Regarding 2nd problem, I had already tried what you suggested, but there's is a problem with the method. See this link: How to call genotype bases for all the samples?

ADD REPLY
0
Entering edit mode

Adding an answer in that question

ADD REPLY
1
Entering edit mode
8.8 years ago
MAPK ★ 2.1k

I think you just want the GT column along with the AD column. I would also get all the columns including REF and ALT and make a nicely formatted tab separated table structure using R scripts.

Take a look at this:

http://samtools.github.io/hts-specs/VCFv4.1.pdf

ADD COMMENT
1
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 1771 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6