How to extract FORMAT information (GT, PG, PI, etc) from each SAMPLE in a vcf file using pyvcf in python?
2
0
Entering edit mode
8.0 years ago
kirannbishwa01 ★ 1.6k

I want to extract the GT, PG and PI information from all the samples of a vcf. I have python script and reading throught pyVCF module I can extract information from one sample but not from all. And, I am looking for the solution scrictly in python using pyvcf. I can probably read the file (vcf) as txt and parse through each line but I am hoping that pyVCF can be used (which I prefer).

Below is the structure of my vcf file:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  2ms01e  2ms02g  2ms03g  2ms04h
2   1738    .   A   G   4693.24 PASS    AC=2;AF=0.250;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW      0|1:389,92:481:99:.,.,.,.,.:1.0:0|1:1020:1748,0,12243:0|1       0/0:318,0:318:99:.:.:0/0:.:0,120,1800:0/0       0|1:270,53:323:99:.,.,.,.,.:1.0:0|1:1258:990,0,9096:0|1     0/0:473,0:473:99:.:.:0/0:.:0,120,1800:0/0
2   1764    .   A   C   51892.85    PASS    AC=5;AF=0.625;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW      1|0:102,415:517:99:.,.,.,.,.:1.0:1|0:1020:12332,0,2817:1|0      1/1:0,356:356:99:.:.:1/1:.:12587,1069,0:1/1     1|0:65,301:366:99:.,.,.,.,.:1.0:1|0:1258:9337,0,1279:1|0    0/1:281,353:634:99:.:.:0/1:.:10325,0,7548:0/1
2   1921    .   T   C   4465.03 PASS    AC=0;AF=0.00;AN=6;set=Intersection  GT:AD:DP:GQ:PG:PL:PW    0/0:1,0:1:3:0/0:0,3,35:0/0  ./.:0,0:0:.:./.:0,0,0:./.   0/0:1,0:1:3:0/0:0,3,39:0/0  0/0:2,0:2:6:0/0:0,6,80:0/0

Below is the python script I wrote:

vcf_file = vcf.Reader(open('RNAseq.phased_variants.Final_sub.vcf', 'r'))
    for record in vcf_file:
    contig = record.CHROM
    pos = record.POS
    ref_allele = record.REF
    alt_alleles = ",".join(map(str, (record.ALT[::])))
    genotype_2ms02g = record.genotype('2ms02g')['GT']
    phased_2ms02g = record.genotype('2ms02g')['PG']
    PI_2ms02g = record.genotype('2ms02g')['PI']
    call_2ms02g = record.genotype('2ms02g') # calling the record values of the sample 2ms02g
    gt_bases_2ms02g = call_2ms02g.gt_bases   #This reports the genotype as letters (nucleotide code)

Problem #1:

I can repeat the procedure for other samples as:

    genotype_2ms01e = record.genotype('2ms01e')['GT']
    phased_2ms01e = record.genotype('2ms01e')['PG']
    PI_2ms01e = record.genotype('2ms01e')['PI']
    call_2ms01e = record.genotype('2ms01e') 
    gt_bases_2ms01e = call_2ms01e.gt_bases

But, I want to call these values from all the samples and write it as a table. vcf file may have many samples and I want to call the values from all the samples without doing so for each sample separately.

Finally I want to write the variables as tables

    output = open("2ms02gVcf_to_table-Markov.txt", "a")
    output.write(contig + '\t' + str(pos) + '\t' + ref_allele + '\t' + str(alt_alleles) + '\t' + genotype_2ms01e + '\t' + phased_2ms01e +'\t' + str(PI_2ms01e) + '\t'+ gt_bases_2ms01e + '\t' + same for other samples........ +'\n')
    output.close()

Problem #2:

Not, all sites (or lines) have PG and PI values. I wanted to throw an exception error using (by adding below script) but its not working.

    try:
        phased_genotype = record.genotype('2ms02g')['PG']
        phase_block_index = record.genotype('2ms02g')['PI']
    except IndexError:
        phased_genotype = '.'
        phase_block_index = '.'

Thank you advance !

vcf python pyvcf FORMAT SAMPLE • 5.0k views
ADD COMMENT
1
Entering edit mode
8.0 years ago
Noushin N ▴ 600

Problem 1

It can be addressed by using python list comprehensions as follows:

If you are sure that all samples include GT field, you can use:

sample_genotypes = [record.genotype(sampleID)['GT'] for sampleID in sampleIDs]
print >>output, '\t'.join([contig, pos, ref_allele, ','.join(alt_alleles)] + sample_genotypes)

Problem 2

In the exception clause KeyError is more appropriate here.

ADD COMMENT
0
Entering edit mode

Upvoted, but don't use print >> output. Use output.write()
print >> is considered old, bad synthax and should be avoided.

Alternatively: print("hi there", file=output)

See also http://stackoverflow.com/questions/6159900/correct-way-to-write-line-to-file-in-python

ADD REPLY
0
Entering edit mode

Thanks for the tip! I wasn't aware that this syntax is discouraged.

ADD REPLY
0
Entering edit mode

@Noushin: I tried using KeyError but that didn't help. Can you look into this problem: PyVCF is giving 'AttributeError' when extracting values from FORMAT and SAMPLE column.

ADD REPLY
0
Entering edit mode

I can use the above code with PG and PI. So, if these fields are missing in the line how do I handle it. Also, I have unresolved reference (i.e NameError) for sampleIDs.

ADD REPLY
0
Entering edit mode

Have you defined your sampleIDs variable before referencing it?

Regarding PG and PI fields, here is a fix that can handle missing values:

sample_PG = [record.genotype(sampleID).get('PG') if 'PG' in record.genotype(sampleID) else '.'  for sampleID in sampleIDs]
sample_PI = [record.genotype(sampleID).get('PI') if 'PI' in record.genotype(sampleID) else '.'  for sampleID in sampleIDs]

ADD REPLY
0
Entering edit mode

But, the problem with NameError still exists. NO, I have not defined sampleIDs variable any where. I recently changed the structure of my code which handles the extraction of the fields in a better way but exception error still exists. Can you please look into this: PyVCF is giving 'AttributeError' when extracting values from FORMAT and SAMPLE column.

ADD REPLY
0
Entering edit mode

The way I refferenced the sampleIDs is this:

sampleIDs = record.samples[::]

Is something wrong with it?

Then I called: sample_genotypes = [record.genotype(sampleID)['GT'] for sampleID in sampleIDs]

ADD REPLY

Login before adding your answer.

Traffic: 2000 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6