Is readMethod required? Here, I want to use variants in a vcf, so no reads are being used, I think? So do I need this parameter, or the readMapping_key_file.txt?
Could you be a little more specific on which PHG steps you have completed? Have you loaded both reference and assembly haplotypes (or haplotypes from WGS) to your database?
The imputation step creates a haplotype graph based on haplotypes stored in the graph. The graph code uses the haplotype methods provided to determine which haplotypes will be included in the graph. Once the graph is created the data is imputed using the specified impute type.
When the impute type is "vcf", you do not need a readMethod. But you do need the pangenomeHaplotypeMethod to be specified. This method(s) is used to create an index from the vcf file.
I'll have one of my colleagues who works more closely with the imputation code respond with any additional details. We will also work to update the documentation to make this clear.
All of the parameters in the config file with UNASSIGNED need to have actual values filled in. You have to assign a valid name and path for the vcf index file, but that will be created by the pipeline if it does not exist. Yes, you need a read method. The vcf calls are used to create read mappings. Essentially, each vcf site is treated as a set of reads with allele depth acting as read count (if countAlleleDepths = true). vcfDir is the directory where the vcf files that are to be processed are stored. In the config file, vcfFile is the file used to index the vcf positions and allele calls against the PHG. Once that index is created, it can be used to "map reads" and impute paths from multiple VCF files with the same positions. That step requires a key file containing the file names of those vcf files (not the directory) while the vcfDir is the directory they are in. The key file has to have two columns labeled flowcell_lane and filename. The vcf file names go in the filename column. This may seem like overkill if you only have a single vcf file to process, but it was designed to be able to create a single index to process multiple vcf files. One thing that can be confusing is that vcfFile is the full path, but the keyFile contains only file names, which are expected to be in vcfDir.
It looks like the FilterGVCFSingleFilePlugin has replaced the deprecated FilterGVCFPlugin, but we have neglected to update the documentation. I will work on that.
In the meantime, the FilterGVCFSingleFilePlugin takes as parameters the input file path, an output file path and a configfile. It uses parameters from the config file to run the "bcftools filter" command. You can specify the filter parameters in the configfile by adding an "exclusionString=" property. When this property is not present, the exclusion string is created based on other properties in the config file. The properties that may be included in the config file are:
The PHG wiki has been updated for FilterGVCFSingleFilePlugin. You can find the instructions if you go to the wiki, click on either the 1.0 or pre 1.0 instructions, go to Step 2, under the Create Haplotypes section, select the option for "FIlter GVCF, add to database". In the text, click on the link for FilterGVCFSingleFilePlugin.
Hi there, I have another question about imputation.
I built my DB with a single reference and gvcf files from short read mapping.
I got a partial vcf containing variants when attempting imputation using fastq files, however the job eventually failed with:
[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathsToVCFPlugin - PathsToVCFPlugin: processData: number of ranges: 932
[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathsToVCFPlugin - PathsToVCFPlugin: processData: number of taxa: 1
[pool-1-thread-1] DEBUG net.maizegenetics.plugindef.AbstractPlugin - Parent job is Cancelling
kotlinx.coroutines.JobCancellationException: Parent job is Cancelling; job=UndispatchedCoroutine{Cancelling}@4dac3dc2
Caused by: java.lang.IllegalStateException: Allele in genotype ACCACTGCCGGATCTGGAGGCCGGGCCACCGCCGCCGCCCCTG not in the variant context [A*, G]
I assume that I need to apply more filtering to my GVCF files to avoid this kind of ambiguous variant problem. I would like to see the documentation for the -FilterGVCFSingleFilePlugin but the file seems to have been removed from the wiki page.
Here's some of the output from the imputation step. It seems to find the path OK, but can't make the VCF?
[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathFinderForSingleTaxonNodes - Finding path for chromosome 10 using ViterbiOneGametePerNode
[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathFinderForSingleTaxonNodes - Finished processing reads for a sample: 36 ranges discarded out of
53.
[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathFinderForSingleTaxonNodes - 36 ranges had too few reads; 0 ranges had too many reads; 0 ranges
had all reads equal
[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathFinderForSingleTaxonNodes - Finding path for chromosome 11 using ViterbiOneGametePerNode
[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathFinderForSingleTaxonNodes - Finished processing reads for a sample: 57 ranges discarded out of
76.
[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathFinderForSingleTaxonNodes - 57 ranges had too few reads; 0 ranges had too many reads; 0 ranges
had all reads equal
[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathFinderForSingleTaxonNodes - Finding path for chromosome 12 using ViterbiOneGametePerNode
[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathFinderForSingleTaxonNodes - Finished processing reads for a sample: 70 ranges discarded out of
91.
[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathFinderForSingleTaxonNodes - 70 ranges had too few reads; 0 ranges had too many reads; 0 ranges
had all reads equal
[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.BestHaplotypePathPlugin - found path for 8401 in 62 ms
[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - first connection: dbName from config file = /phg/rc_small_db.db host: localHost u
ser: sqlite type: sqlite
[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - Database URL: jdbc:sqlite:/phg/rc_small_db.db
[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - Connected to database:
[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.ImportDiploidPathPlugin - importPathsFromDB: query: SELECT line_name, paths_data FROM paths, genot
ypes, methods WHERE paths.genoid=genotypes.genoid AND methods.method_id=paths.method_id AND methods.name IN ('rc8401x')
[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.ImportDiploidPathPlugin - importPathsFromDB: number of path list: 1
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Finished net.maizegenetics.pangenome.hapCalling.ImportDiploidPathPlugin: time: Feb 18, 2023
4:49:51
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Starting net.maizegenetics.pangenome.hapCalling.PathsToVCFPlugin: time: Feb 18, 2023 4:49:51
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin -
PathsToVCFPlugin Parameters
outputFile: /phg/rc8401.vcf
refRangeFileVCF: null
referenceFasta: null
makeDiploid: true
positions: null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathsToVCFPlugin - PathsToVCFPlugin: processData: number of ranges: 932
[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.PathsToVCFPlugin - PathsToVCFPlugin: processData: number of taxa: 1
[pool-1-thread-1] DEBUG net.maizegenetics.plugindef.AbstractPlugin - Allele in genotype ACCACTGCCGGATCTGGAGGCCGGGCCACCGCCGCCGCCCCTG not in the variant context [A, G]
java.lang.IllegalStateException: Allele in genotype ACCACTGCCGGATCTGGAGGCCGGGCCACCGCCGCCGCCCCTG not in the variant context [A, G]
at htsjdk.variant.variantcontext.VariantContext$Validation.validateGenotypes(VariantContext.java:382)
at htsjdk.variant.variantcontext.VariantContext$Validation.access$200(VariantContext.java:323)
at htsjdk.variant.variantcontext.VariantContext$Validation$2.validate(VariantContext.java:331)
at htsjdk.variant.variantcontext.VariantContext.lambda$validate$0(VariantContext.java:1384)
at java.lang.Iterable.forEach(Iterable.java:75)
at htsjdk.variant.variantcontext.VariantContext.validate(VariantContext.java:1384)
at java.lang.Iterable.forEach(Iterable.java:75)
at htsjdk.variant.variantcontext.VariantContext.validate(VariantContext.java:1384)
at htsjdk.variant.variantcontext.VariantContext.<init>(VariantContext.java:489)
at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:647)
at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:638)
at net.maizegenetics.pangenome.hapCalling.PathsToVCFPlugin.createVariantContext$phg(PathsToVCFPlugin.kt:406)
at net.maizegenetics.pangenome.hapCalling.PathsToVCFPlugin.variantContexts(PathsToVCFPlugin.kt:555)
at net.maizegenetics.pangenome.hapCalling.PathsToVCFPlugin$infosByRange$2$1$1.invokeSuspend(PathsToVCFPlugin.kt:238)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin -
Dear Lynn,
Many thanks for responding. I will update further when I am at the machine next week.
Thanks again
Matt