Question

Calling star alleles with PGx-POP on my VCF file

0

Entering edit mode

2.4 years ago

Vanish007 ▴ 50

Hi everyone,

I am currently trying to have star alleles called on my phased vcf file. So far I have a vcf file with a few thousand samples that has been phased with BEAGLE 5.2.

The next step is to run PGx-POP to obtain star allele calls, however when I run it, I mainly have only *1's called in the haplotype 1 and haplotype 2 any phenotypes with either "Not available" or "Normal function".

This is what I used to run the script:

python 3.6 PGxPOP.py --vcf MyPhasedVCF --phased -o PGxPOP_output.txt

I even looked up a table of SNPs associated with star alleles and ran a manual grep for a SNP (rs1057910) associated with a *2 allele with gene CYP2C9 to make sure it was in my dataset and the genotype definitely showed up in more than one sample.

Any idea what I could be doing wrong? Is my vcf phased incorrectly? Any troubleshooting ideas I could try?

Thank you kindly!

vcf alleles pgx-pop star • 2.9k views

ADD COMMENT • link updated 2.3 years ago by sbstevenlee ▴ 480 • written 2.4 years ago by Vanish007 ▴ 50

score 1 · Answer 1 · 2022-07-26

1

Entering edit mode

2.4 years ago

sbstevenlee ▴ 480

Hi, it's very nice to finally bump into a Biostars post about star alleles! I rarely see posts on pharmacogeneics here. That said, I haven't used PGxPOP so I can't comment on your question specifically.

However, I have developed a star allele-calling tool, PyPGx, so you may want to check it out (I'm also the developer of another tool called Stargazer). You can use PyPGx to directly call star alleles using phased VCF (see this NGS pipeline section). A huge benefit of using PyPGx over PGxPOP is that PyPGx supports detection of structural variation using a machine learning-based approach (e.g. gene deletion, duplication, hybrid in CYP2D6, etc.). You can read more on this from the Structural variation detection section. Of course, this requires more than just VCF (i.e. requires BAM files to be more specific).

In any case, I'd be more than happy to help if you have any questions regarding PyPGx or other general pharmacoegenetics-related matters.

ADD COMMENT • link 2.4 years ago by sbstevenlee ▴ 480

0

Entering edit mode

Hi Dr. Lee,

Thank you very much for reaching out! I'm glad I bumped into you on here since I actually looked into Stargazer as an alternative, but was not sure since the documentation mentioned that it uses GRCh37 and I also wasn't sure about VCF files that were phased. I currently do not have access to the original BAM files so I only have the two VCFs. I have one vcf file that is on a GRCh37 build and another that has GRCh38 so it wouldn't work for that. I am currently looking at PyPGx though and reading through the documentation. I will try and run it on my files first thing tomorrow. Greatly appreciate your help on this as well, thanks!

ADD REPLY • link 2.4 years ago by Vanish007 ▴ 50

0

Entering edit mode

Yes, PyPGx natively supports both GRCh37 and GRCh38, so I got you covered :) Please give it a try and let me know if you need any help either here or in the GitHub issue. I'm pretty quick at response.

ADD REPLY • link 2.4 years ago by sbstevenlee ▴ 480

0

Entering edit mode

Hi Dr. Lee, I wanted to follow up and let you know that PyPGx did work the way I needed - what a great tool and well documented, thank you! :)

In terms of the genes that have structural variation, if I don't have access to the BAM files there's no other way to obtain the CovFrame and SampleStatistics, correct? Does PyPGx just run through as normally? I just wanted to make a note in case I was asked. Thanks again!

ADD REPLY • link 2.4 years ago by Vanish007 ▴ 50

1

Entering edit mode

Glad to hear it worked!

In order to perform SV detection, then yes, you would need BAM files to create CovFrame[DepthOfCoverage] and SampleTable[Statistics]. If you haven't already, I highly recommend that you complete this tutorial.

Even if the target gene has SV you can still run PyPGx without CovFrame[DepthOfCoverage] and SampleTable[Statistics] and it will work fine. In this case, it will be assumed that all of the samples don't have SV. This is actually a valid assumption for certain genes such as CYP4F2. However, as you would expect, the assumption breaks down quickly for genes like CYP2D6, GSTM1, UGT2B17 that are known to have a high rate of SV events.

Alternatively, if VCF is the only thing you got, and your VCF contains the AD field (i.e. allelic depth), you could try to manually call simple copy number variation (CNV) from PyPGx's allele fraction profile (i.e. without copy number data). PyPGx will output allele fraction profile even in the absence of copy number data. You can find a copious amount of examples in the PyPGx website. For example, below is a gene duplication event from CYP2D6. You don't need copy number data (left) to figure out that there is gene duplication; the allele fraction profile (right) shows a clear allelic imbalance of 1:2.

Hope this helps. Let me know if you have additional questions.

ADD REPLY • link 2.4 years ago by sbstevenlee ▴ 480

0

Entering edit mode

Thanks Dr. Lee, that was quite helpful. I checked gene CYP3A5 and that worked beautifully, but one thing I noticed was that certain genes (ie CYP1B1, CYP2R1, CYP17A1, and a few others) have given phenotypes as "Indeterminate" across all samples. Does this mean something is wrong with the phasing or something else?

Thanks again for all your help!

ADD REPLY • link 2.4 years ago by Vanish007 ▴ 50

1

Entering edit mode

Great question! That's because those genes don't have defined phenotypes yet (e.g. from CPIC). Please check the summary table in the Gene section of the docs. There, you will find which genes have defined phenotypes.

ADD REPLY • link 2.4 years ago by sbstevenlee ▴ 480

0

Entering edit mode

Hi Dr. Lee,

Thanks again for some great advice! I've been making some figures with the data obtained from PyPGx and I noticed something not quite right. Apparently in some of the genes called, variant genotypes were called more often than the normal. For example in gene CYP2C19, the genotype for 1/35 (Intermediate Metabolizer) was called in more than 6,000 cases whereas the 1/1 genotype (Normal Metabolizer) was called for only a few hundred - I figured this should be the opposite. There were a couple of genes where this occurred where a different genotype is called in more cases than 1/1. Is this a correct call or does a more stringent parameter need to be run to removed or is something else wrong?

Thank you!

ADD REPLY • link 2.3 years ago by Vanish007 ▴ 50

1

Entering edit mode

Another great question! The short answer is, yes, for some genes it's certainly possible (or even typical I would say for genes like CYP2C19 and CYP3A5 in particular) to observe non-Normal Metabolizer at a higher rate than Normal Metabolizer. Of curse, the exact phenotype frequency will depend on which population you are looking at (e.g. East Asian). As an example, below I will show the phenotype distribution of CYP2C19 in the 1000 Genomes Project predicted by PyPGx:

CYP2C19

I hope this helps!

P.S. You will get my response a lot quicker if you directly send me an email (sbstevenlee@gmail.com) or create an issue/discussion at the PyPGx GitHub repository (I don't get any notifications from the Biostars website until I visit it).

ADD REPLY • link 2.3 years ago by sbstevenlee ▴ 480