I want to extract the GT
, PG
and PI
information from all the samples of a vcf. I have python script and reading throught pyVCF
module I can extract information from one sample but not from all.
And, I am looking for the solution scrictly in python using pyvcf. I can probably read the file (vcf) as txt and parse through each line but I am hoping that pyVCF
can be used (which I prefer).
Below is the structure of my vcf file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 2ms01e 2ms02g 2ms03g 2ms04h
2 1738 . A G 4693.24 PASS AC=2;AF=0.250;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW 0|1:389,92:481:99:.,.,.,.,.:1.0:0|1:1020:1748,0,12243:0|1 0/0:318,0:318:99:.:.:0/0:.:0,120,1800:0/0 0|1:270,53:323:99:.,.,.,.,.:1.0:0|1:1258:990,0,9096:0|1 0/0:473,0:473:99:.:.:0/0:.:0,120,1800:0/0
2 1764 . A C 51892.85 PASS AC=5;AF=0.625;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW 1|0:102,415:517:99:.,.,.,.,.:1.0:1|0:1020:12332,0,2817:1|0 1/1:0,356:356:99:.:.:1/1:.:12587,1069,0:1/1 1|0:65,301:366:99:.,.,.,.,.:1.0:1|0:1258:9337,0,1279:1|0 0/1:281,353:634:99:.:.:0/1:.:10325,0,7548:0/1
2 1921 . T C 4465.03 PASS AC=0;AF=0.00;AN=6;set=Intersection GT:AD:DP:GQ:PG:PL:PW 0/0:1,0:1:3:0/0:0,3,35:0/0 ./.:0,0:0:.:./.:0,0,0:./. 0/0:1,0:1:3:0/0:0,3,39:0/0 0/0:2,0:2:6:0/0:0,6,80:0/0
Below is the python script I wrote:
vcf_file = vcf.Reader(open('RNAseq.phased_variants.Final_sub.vcf', 'r'))
for record in vcf_file:
contig = record.CHROM
pos = record.POS
ref_allele = record.REF
alt_alleles = ",".join(map(str, (record.ALT[::])))
genotype_2ms02g = record.genotype('2ms02g')['GT']
phased_2ms02g = record.genotype('2ms02g')['PG']
PI_2ms02g = record.genotype('2ms02g')['PI']
call_2ms02g = record.genotype('2ms02g') # calling the record values of the sample 2ms02g
gt_bases_2ms02g = call_2ms02g.gt_bases #This reports the genotype as letters (nucleotide code)
Problem #1:
I can repeat the procedure for other samples as:
genotype_2ms01e = record.genotype('2ms01e')['GT']
phased_2ms01e = record.genotype('2ms01e')['PG']
PI_2ms01e = record.genotype('2ms01e')['PI']
call_2ms01e = record.genotype('2ms01e')
gt_bases_2ms01e = call_2ms01e.gt_bases
But, I want to call these values from all the samples and write it as a table. vcf file may have many samples and I want to call the values from all the samples without doing so for each sample separately.
Finally I want to write the variables as tables
output = open("2ms02gVcf_to_table-Markov.txt", "a")
output.write(contig + '\t' + str(pos) + '\t' + ref_allele + '\t' + str(alt_alleles) + '\t' + genotype_2ms01e + '\t' + phased_2ms01e +'\t' + str(PI_2ms01e) + '\t'+ gt_bases_2ms01e + '\t' + same for other samples........ +'\n')
output.close()
Problem #2:
Not, all sites (or lines) have PG
and PI
values.
I wanted to throw an exception error
using (by adding below script) but its not working.
try:
phased_genotype = record.genotype('2ms02g')['PG']
phase_block_index = record.genotype('2ms02g')['PI']
except IndexError:
phased_genotype = '.'
phase_block_index = '.'
Thank you advance !
Upvoted, but don't use
print >> output
. Useoutput.write()
print >>
is considered old, bad synthax and should be avoided.Alternatively:
print("hi there", file=output)
See also http://stackoverflow.com/questions/6159900/correct-way-to-write-line-to-file-in-python
Thanks for the tip! I wasn't aware that this syntax is discouraged.
@Noushin: I tried using
KeyError
but that didn't help. Can you look into this problem: PyVCF is giving 'AttributeError' when extracting values from FORMAT and SAMPLE column.I can use the above code with
PG
andPI
. So, if these fields are missing in the line how do I handle it. Also, I haveunresolved reference
(i.eNameError
) forsampleIDs
.Have you defined your sampleIDs variable before referencing it?
Regarding
PG
andPI
fields, here is a fix that can handle missing values:But, the problem with
NameError
still exists. NO, I have not definedsampleIDs
variable any where. I recently changed the structure of my code which handles the extraction of thefields
in a better way but exception error still exists. Can you please look into this: PyVCF is giving 'AttributeError' when extracting values from FORMAT and SAMPLE column.The way I refferenced the
sampleIDs
is this:sampleIDs = record.samples[::]
Is something wrong with it?
Then I called:
sample_genotypes = [record.genotype(sampleID)['GT'] for sampleID in sampleIDs]