Question

imputing haplotypes from variants in a VCF file - vcfIndexFile

0

Entering edit mode

21 months ago

matt.shenton ▴ 40

Hi there,

Trying to get my PHG imputation pipeline running.

In the configuration file example for "imputing haplotypes from variants in a VCF file", one parameter is indexFile=/phg/outputDir/vcfIndexFile

Where should I derive this vcfIndexFile?

Or, is it generated by the pipeline (it is located in the outputDir, so maybe?)

Also,

inputType=vcf
keyFile=/phg/readMapping_key_file.txt
readMethod=**UNASSIGNED**
vcfDir=/phg/inputDir/imputation/vcf/
countAlleleDepths=true

Is readMethod required? Here, I want to use variants in a vcf, so no reads are being used, I think? So do I need this parameter, or the readMapping_key_file.txt?

Sorry if I am misunderstanding obvious points.

Best wishes
Matt Shenton

vcf imputation PHG • 2.3k views

ADD COMMENT • link updated 21 months ago by matt.shenton • 0 • written 21 months ago by matt.shenton ▴ 40

score 1 · Answer 1 · 2023-02-09

HI Matt -

Could you be a little more specific on which PHG steps you have completed? Have you loaded both reference and assembly haplotypes (or haplotypes from WGS) to your database?

The imputation step creates a haplotype graph based on haplotypes stored in the graph. The graph code uses the haplotype methods provided to determine which haplotypes will be included in the graph. Once the graph is created the data is imputed using the specified impute type.

When the impute type is "vcf", you do not need a readMethod. But you do need the pangenomeHaplotypeMethod to be specified. This method(s) is used to create an index from the vcf file.

I'll have one of my colleagues who works more closely with the imputation code respond with any additional details. We will also work to update the documentation to make this clear.

Lynn

score 1 · Answer 2 · 2023-02-09

Hi Matt,

All of the parameters in the config file with UNASSIGNED need to have actual values filled in. You have to assign a valid name and path for the vcf index file, but that will be created by the pipeline if it does not exist. Yes, you need a read method. The vcf calls are used to create read mappings. Essentially, each vcf site is treated as a set of reads with allele depth acting as read count (if countAlleleDepths = true). vcfDir is the directory where the vcf files that are to be processed are stored. In the config file, vcfFile is the file used to index the vcf positions and allele calls against the PHG. Once that index is created, it can be used to "map reads" and impute paths from multiple VCF files with the same positions. That step requires a key file containing the file names of those vcf files (not the directory) while the vcfDir is the directory they are in. The key file has to have two columns labeled flowcell_lane and filename. The vcf file names go in the filename column. This may seem like overkill if you only have a single vcf file to process, but it was designed to be able to create a single index to process multiple vcf files. One thing that can be confusing is that vcfFile is the full path, but the keyFile contains only file names, which are expected to be in vcfDir.

Peter

score 1 · Answer 3 · 2023-02-16

1

Entering edit mode

21 months ago

lcj34 ▴ 420

Hi Matt -

It looks like the FilterGVCFSingleFilePlugin has replaced the deprecated FilterGVCFPlugin, but we have neglected to update the documentation. I will work on that.

In the meantime, the FilterGVCFSingleFilePlugin takes as parameters the input file path, an output file path and a configfile. It uses parameters from the config file to run the "bcftools filter" command. You can specify the filter parameters in the configfile by adding an "exclusionString=" property. When this property is not present, the exclusion string is created based on other properties in the config file. The properties that may be included in the config file are:

DP_poisson_min

DP_poisson_max

DP_min

DP_max

GQ_min

GQ_max

QUAL_min

QUAL_max

filterHets (true or false, t or f)

Lynn

ADD COMMENT • link 21 months ago by lcj34 ▴ 420

1

Entering edit mode

The PHG wiki has been updated for FilterGVCFSingleFilePlugin. You can find the instructions if you go to the wiki, click on either the 1.0 or pre 1.0 instructions, go to Step 2, under the Create Haplotypes section, select the option for "FIlter GVCF, add to database". In the text, click on the link for FilterGVCFSingleFilePlugin.

https://bitbucket.org/bucklerlab/practicalhaplotypegraph/wiki/UserInstructions/CreatePHG_step2_FilterGVCFSingleFilePlugin.md

ADD REPLY • link 21 months ago by lcj34 ▴ 420

0

Entering edit mode

Thanks a lot!

ADD REPLY • link 21 months ago by matt.shenton • 0

score 0 · Answer 4 · 2023-02-16

0

Entering edit mode

21 months ago

matt.shenton • 0

Hi there, I have another question about imputation.

I built my DB with a single reference and gvcf files from short read mapping.

I got a partial vcf containing variants when attempting imputation using fastq files, however the job eventually failed with:

[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathsToVCFPlugin - PathsToVCFPlugin: processData: number of ranges: 932 [pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathsToVCFPlugin - PathsToVCFPlugin: processData: number of taxa: 1 [pool-1-thread-1] DEBUG net.maizegenetics.plugindef.AbstractPlugin - Parent job is Cancelling kotlinx.coroutines.JobCancellationException: Parent job is Cancelling; job=UndispatchedCoroutine{Cancelling}@4dac3dc2 Caused by: java.lang.IllegalStateException: Allele in genotype ACCACTGCCGGATCTGGAGGCCGGGCCACCGCCGCCGCCCCTG not in the variant context [A*, G]

I assume that I need to apply more filtering to my GVCF files to avoid this kind of ambiguous variant problem. I would like to see the documentation for the -FilterGVCFSingleFilePlugin but the file seems to have been removed from the wiki page.

Many thanks for your help

Matt

ADD COMMENT • link 21 months ago by matt.shenton • 0

0

Entering edit mode

Hi again,

I used -FilterGVCFSingleFilePlugin with mapQ=48 DP_poisson_min=0.01 DP_poisson_max=0.99 GQ_min=50 filterHets=t

before building my test DB. After buidling consensus at mxdiv 0.001, I'm still getting an error message when outputting an imputed vcf.

java.lang.IllegalStateException: Allele in genotype ACCACTGCCGGATCTGGAGGCCGGGCCACCGCCGCCGCCCCTG not in the variant context [A*, G]

I'm using data with 10 gvcf files, there's only one position with this genotype:

(base) mshenton@2021-20-00029:~/analysis/PHGsingularity/testrc/inputDir/loadDB/gvcf$ bcftools view -r 01:18831899 variants_rc8403.filtered_num.g.vcf.gz | grep ^01 01 18831899 . A G,<NON_REF> 793.77 PASS DP=17;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1,0;RAW_MQandDP=61200,17 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,17,0:17:57:0|1:18831888_C_T:822,57,0,822,57,822:0,0,12,5

(base) mshenton@2021-20-00029:~/analysis/PHGsingularity/testrc/inputDir/loadDB/gvcf$ bcftools view -r 01:18831899 variants_rc8405.filtered_num.g.vcf.gz | grep ^01 01 18831899 . A G,<NON_REF> 620.77 PASS DP=17;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1,0;RAW_MQandDP=60001,17 GT:AD:DP:GQ:PL:SB 1/1:0,17,0:17:50:649,50,0,649,50,649:0,0,11,6

(base) mshenton@2021-20-00029:~/analysis/PHGsingularity/testrc/inputDir/loadDB/gvcf$ bcftools view --regions-file ../../reference/1000.bed variants_rc8321.filtered_num.g.vcf.gz | grep ACCACTGCCGGATCTGGAGGCCGGGCCACCGCCGCCGCCCCTG 01 18831899 . A ACCACTGCCGGATCTGGAGGCCGGGCCACCGCCGCCGCCCCTG,<NON_REF> 1267.73 PASS DP=18;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1,0;RAW_MQandDP=63136,18 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,18,0:18:87:0|1:18831895_C_CCCCGCGCCTCCCCT:1305,87,0,1305,87,1305:0,0,16,2

(base) mshenton@2021-20-00029:~/analysis/PHGsingularity/testrc/inputDir/loadDB/gvcf$ bcftools view --regions-file ../../reference/1000.bed variants_rc8322.filtered_num.g.vcf.gz | grep ACCACTGCCGGATCTGGAGGCCGGGCCACCGCCGCCGCCCCTG 01 18831899 . A ACCACTGCCGGATCTGGAGGCCGGGCCACCGCCGCCGCCCCTG,<NON_REF> 1077.73 PASS DP=22;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1,0;RAW_MQandDP=72542,22 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,22,0:22:75:0|1:18831895_C_CCCCGCGCCTCCCCT:1115,75,0,1115,75,1116:0,0,12,10

(base) mshenton@2021-20-00029:~/analysis/PHGsingularity/testrc/inputDir/loadDB/gvcf$ bcftools view --regions-file ../../reference/1000.bed variants_rc8323.filtered_num.g.vcf.gz | grep ACCACTGCCGGATCTGGAGGCCGGGCCACCGCCGCCGCCCCTG 01 18831899 . A ACCACTGCCGGATCTGGAGGCCGGGCCACCGCCGCCGCCCCTG,<NON_REF> 951.73 PASS DP=17;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1,0;RAW_MQandDP=60201,17 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,16,0:16:66:0|1:18831895_C_CCCCGCGCCTCCCCT:989,66,0,990,66,990:0,0,6,10

(base) mshenton@2021-20-00029:~/analysis/PHGsingularity/testrc/inputDir/loadDB/gvcf$ bcftools view --regions-file ../../reference/1000.bed variants_rc8326.filtered_num.g.vcf.gz | grep ACCACTGCCGGATCTGGAGGCCGGGCCACCGCCGCCGCCCCTG 01 18831899 . A *,ACCACTGCCGGATCTGGAGGCCGGGCCACCGCCGCCGCCCCTG,<NON_REF> 716.73 PASS DP=17;ExcessHet=3.0103;MLEAC=0,2,0;MLEAF=0,1,0;RAW_MQandDP=57438,17 GT:AD:DP:GQ:PL:SB 2/2:0,0,15,0:15:53:754,759,834,53,56,0,754,760,54,755:0,0,7,5

(base) mshenton@2021-20-00029:~/analysis/PHGsingularity/testrc/inputDir/loadDB/gvcf$ bcftools view --regions-file ../../reference/1000.bed variants_rc8328.filtered_num.g.vcf.gz | grep ACCACTGCCGGATCTGGAGGCCGGGCCACCGCCGCCGCCCCTG 01 18831899 . A G,ACCACTGCCGGATCTGGAGGCCGGGCCACCGCCGCCGCCCCTG,<NON_REF> 1131.73 PASS DP=19;ExcessHet=3.0103;MLEAC=0,2,0;MLEAF=0,1,0;RAW_MQandDP=67201,19 GT:AD:DP:GQ:PL:SB 2/2:0,0,19,0:19:81:1169,1047,1101,81,86,0,1118,1084,86,1136:0,0,15,4

I guess the problem is causd by the genotype 01 18831899 . A *,ACCACTGCCGGATCTGGAGGCCGGGCCACCGCCGCCGCCCCTG,<NON_REF> 2/2 in sample rc8328

Should the pipelines be able to handle this kind of variant? (according to the GVCF information, * means a spanning deletion, I think)?

GVCFs were created using GATK's haplotypeCaller

ADD REPLY • link 21 months ago by matt.shenton • 0

0

Entering edit mode

Here's some of the output from the imputation step. It seems to find the path OK, but can't make the VCF?

[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathFinderForSingleTaxonNodes - Finding path for chromosome 10 using ViterbiOneGametePerNode [pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathFinderForSingleTaxonNodes - Finished processing reads for a sample: 36 ranges discarded out of 53. [pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathFinderForSingleTaxonNodes - 36 ranges had too few reads; 0 ranges had too many reads; 0 ranges had all reads equal [pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathFinderForSingleTaxonNodes - Finding path for chromosome 11 using ViterbiOneGametePerNode [pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathFinderForSingleTaxonNodes - Finished processing reads for a sample: 57 ranges discarded out of 76. [pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathFinderForSingleTaxonNodes - 57 ranges had too few reads; 0 ranges had too many reads; 0 ranges had all reads equal [pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathFinderForSingleTaxonNodes - Finding path for chromosome 12 using ViterbiOneGametePerNode [pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathFinderForSingleTaxonNodes - Finished processing reads for a sample: 70 ranges discarded out of 91. [pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathFinderForSingleTaxonNodes - 70 ranges had too few reads; 0 ranges had too many reads; 0 ranges had all reads equal [pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.BestHaplotypePathPlugin - found path for 8401 in 62 ms

[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - first connection: dbName from config file = /phg/rc_small_db.db host: localHost u ser: sqlite type: sqlite [pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - Database URL: jdbc:sqlite:/phg/rc_small_db.db [pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - Connected to database:

[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.ImportDiploidPathPlugin - importPathsFromDB: query: SELECT line_name, paths_data FROM paths, genot ypes, methods WHERE paths.genoid=genotypes.genoid AND methods.method_id=paths.method_id AND methods.name IN ('rc8401x') [pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.ImportDiploidPathPlugin - importPathsFromDB: number of path list: 1 [pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Finished net.maizegenetics.pangenome.hapCalling.ImportDiploidPathPlugin: time: Feb 18, 2023 4:49:51 [pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Starting net.maizegenetics.pangenome.hapCalling.PathsToVCFPlugin: time: Feb 18, 2023 4:49:51 [pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - PathsToVCFPlugin Parameters outputFile: /phg/rc8401.vcf refRangeFileVCF: null referenceFasta: null makeDiploid: true positions: null

[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathsToVCFPlugin - PathsToVCFPlugin: processData: number of ranges: 932 [pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathsToVCFPlugin - PathsToVCFPlugin: processData: number of taxa: 1 [pool-1-thread-1] DEBUG net.maizegenetics.plugindef.AbstractPlugin - Allele in genotype ACCACTGCCGGATCTGGAGGCCGGGCCACCGCCGCCGCCCCTG not in the variant context [A, G] java.lang.IllegalStateException: Allele in genotype ACCACTGCCGGATCTGGAGGCCGGGCCACCGCCGCCGCCCCTG not in the variant context [A, G] at htsjdk.variant.variantcontext.VariantContext$Validation.validateGenotypes(VariantContext.java:382) at htsjdk.variant.variantcontext.VariantContext$Validation.access$200(VariantContext.java:323) at htsjdk.variant.variantcontext.VariantContext$Validation$2.validate(VariantContext.java:331) at htsjdk.variant.variantcontext.VariantContext.lambda$validate$0(VariantContext.java:1384) at java.lang.Iterable.forEach(Iterable.java:75) at htsjdk.variant.variantcontext.VariantContext.validate(VariantContext.java:1384) at java.lang.Iterable.forEach(Iterable.java:75) at htsjdk.variant.variantcontext.VariantContext.validate(VariantContext.java:1384) at htsjdk.variant.variantcontext.VariantContext.<init>(VariantContext.java:489) at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:647) at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:638) at net.maizegenetics.pangenome.hapCalling.PathsToVCFPlugin.createVariantContext$phg(PathsToVCFPlugin.kt:406) at net.maizegenetics.pangenome.hapCalling.PathsToVCFPlugin.variantContexts(PathsToVCFPlugin.kt:555) at net.maizegenetics.pangenome.hapCalling.PathsToVCFPlugin$infosByRange$2$1$1.invokeSuspend(PathsToVCFPlugin.kt:238) at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106) at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571) at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750) at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678) at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665) [pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin -

ADD REPLY • link 21 months ago by matt.shenton • 0