Question

PHG ImputePipelinePlugin with VCF as input is incomplete?

0

Entering edit mode

3.8 years ago

dovi ▴ 60

Hi all, I have a problem running the PHG ImputePipelinePlugin plugin with inputType=vcf and imputeTarget=diploidPathToVCF. I'm using PHG version 0.026.

To be exact, I manage to run until SNPToReadMappingPlugin (run within ImputePipelinePlugin workflow) but then at DiploidPathPlugin I get the error of "KeyFile_pathKeyFile.txt doesn't exist". I believe that this file should be created automatically somewhere before the "DiploidPathPlugin" and I understand that I don't have to create it manually, as I don't know the readMappingIds (it is created automatically in the case of inputType=fastq). I also realized that after the SNPToReadMappingPlugin is completed, I get the files named _readMappings.txt, but they are not imported into the database and I do not get any error on that step.

Here the log I get from ImputePipelinePlugin (the two last plugins log):

[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Starting net.maizegenetics.pangenome.hapCalling.SNPToReadMappingPlugin: time: Feb 8, 2021 12:03:25
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - 
SNPToReadMappingPlugin Parameters
keyFile: /phg/keyFile.txt
indexFile: /phg/outputDir/VCF_index_file.txt
vcfDir: /phg/inputDir/imputation/vcf/
outputDir: /phg/outputDir/
methodName: DIPLOID_VCF_HAP_COUNT_METHOD_mxDiv.0
methodDescription: test_description
countAlleleDepths: true

[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.SNPToReadMappingPlugin - Processing record: A10,skimseqsFlowcell,A10.vcf,skimseqs
[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.SNPToReadMappingPlugin - Processing record: A11,skimseqsFlowcell,A11.vcf,skimseqs
[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.SNPToReadMappingPlugin - Processing record: A1,skimseqsFlowcell,A1.vcf,skimseqs
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Finished net.maizegenetics.pangenome.hapCalling.SNPToReadMappingPlugin: time: Feb 8, 2021 12:03:38
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Starting net.maizegenetics.pangenome.api.HaplotypeGraphBuilderPlugin: time: Feb 8, 2021 12:03:38
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - 
HaplotypeGraphBuilderPlugin Parameters
configFile: /phg/config.txt
methods: CONSENSUS_mxDiv.0
includeSequences: false
includeVariantContexts: false
haplotypeIds: null
chromosomes: null

[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - first connection: dbName from config file = /phg/TestDBVCF.db host: localHost user: sqlite type: sqlite
[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - Database URL: jdbc:sqlite:/phg/TestDBVCF.db
[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - Connected to database:  

[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - referenceRangesAsMap: query statement: select reference_ranges.ref_range_id, chrom, range_start, range_end, methods.name from reference_ranges  INNER JOIN ref_range_ref_range_method on ref_range_ref_range_method.ref_range_id=reference_ranges.ref_range_id  INNER JOIN methods on ref_range_ref_range_method.method_id = methods.method_id  AND methods.method_type = 7 ORDER BY reference_ranges.ref_range_id
methods size: 1
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - referenceRangesAsMap: number of reference ranges: 2411
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - referenceRangesAsMap: time: 0.01092466 secs.
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - taxaListMap: query statement: SELECT gamete_haplotypes.gamete_grp_id, genotypes.line_name FROM gamete_haplotypes INNER JOIN gametes ON gamete_haplotypes.gameteid = gametes.gameteid INNER JOIN genotypes on gametes.genoid = genotypes.genoid ORDER BY gamete_haplotypes.gamete_grp_id;
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - taxaListMap: number of taxa lists: 3
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - taxaListMap: time: 2.19438E-4 secs.
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - createHaplotypeNodes: haplotype method: CONSENSUS_mxDiv.0 range group method: null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - createHaplotypeNodes: query statement: SELECT haplotypes_id, gamete_grp_id, haplotypes.ref_range_id, asm_contig, asm_start_coordinate, asm_end_coordinate, genome_file_id, seq_hash, seq_len FROM haplotypes WHERE method_id = 5;
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - addNodes: number of nodes: 4800
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - addNodes: number of reference ranges: 2400
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - createHaplotypeNodes: time: 0.480757947 secs.
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.HaplotypeGraph - Created graph edges: created when requested  number of nodes: 4800  number of reference ranges: 2400
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Finished net.maizegenetics.pangenome.api.HaplotypeGraphBuilderPlugin: time: Feb 8, 2021 12:03:38
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Starting net.maizegenetics.pangenome.hapCalling.DiploidPathPlugin: time: Feb 8, 2021 12:03:38
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - 
DiploidPathPlugin Parameters
keyFile: /phg/keyFile_pathKeyFile.txt
readMethod: DIPLOID_VCF_HAP_COUNT_METHOD_mxDiv.0
pathMethod: DIPLOID_VCF_PATH_METHOD_mxDiv.0
pathMethodDescription: null
minTaxa: 1
probCorrect: 0.99
minTransition: 0.001
maxHap: 4
minReads: 0
removeEqual: false
maxReadsKB: 100
splitNodes: false
splitProb: 0.99
numThreads: 8

[pool-1-thread-1] ERROR net.maizegenetics.plugindef.AbstractPlugin - -keyFile: /phg/keyFile_pathKeyFile.txt doesn't exist

[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - 
DiploidPathPlugin Description...
DiploidPathPlugin finds the best path through all ordered pair of nodes in a HaplotypeGraph given a key file with sample names and read mapping ids.

Usage:
DiploidPathPlugin <options>
-keyFile <Key File> : KeyFile file name.  Must be a tab separated file using the following headers:
SampleName  ReadMappingIds  LikelyParents
ReadMappingIds and LikelyParents need to be comma separated for multiple values (required)
-readMethod <Read Mapping Method> : The name of the read mapping method in the PHG DB (required)
-pathMethod <Path Method> : The name of the path method used to write the results to the PHG DB (required)
-pathMethodDescription <Path Method Description> : An additional description that will be stored with the path method name, if desired.
-minTaxa <Min Taxa> : minimum number of taxa per anchor reference range. Ranges with fewer taxa will not be included in the output node list. (Default: 20)
-probCorrect <Probability Correct> : The probability that a read mapped to the correct haplotypes (Default: 0.99)
-minTransition <Min Transition Probability> : The minimum transition probability between a pair of nodes in adjacent reference ranges. (Default: 0.001)
-maxHap <Maximum Number of Haplotypes> : Any range with more than maxHap haplotypes will not be included in the path. (Default: 11)
-minReads <Minimum Read Number> : Any range with fewer than minReads will not be included in the path. (Default: 1)
-removeEqual <true | false> : Any range for which all haplotypes have the same number of read counts will not be included in the path. (Default: false)
-maxReadsKB <Maximum Reads per KB> : Any range with more than maxReadsKB reads per kilobase of sequence will not be included in the path. (Default: 1000)
-splitNodes <true | false> : If splitTaxa is true, then each taxon will be assigned its own node in the graph prior to path finding. (Default: false)
-splitProb <Split Probability> : When splitTaxa is true, the transition probability for moving between nodes of the same taxon will be set to this number.  (Default: 0.99)
-numThreads <Num Threads> : Number of threads used to find paths.  The path finding will subtract 2 from this number to have the number of worker threads.  It leaves 1 thread for IO to the DB and 1 thread for the Operating System. (Default: 3)

The _readMappings.txt files look fine to me (here just a head):

#line_name=A10
#file_group_name=skimseqsFlowcell
#method_name=DIPLOID_VCF_HAP_COUNT_METHOD_mxDiv.0
#method_description=test_description
HapIds  count
11805,11806 9
11805   14
11806   8
10448   20
10447   62

But if I look into the database, there are no new genotypes added (as you can see in the log taxaListMap: number of taxa lists: 3, those are the reference + two samples that were used to create the DB), nor the method named "DIPLOID_VCF_HAP_COUNT_METHOD_mxDiv.0". I'ts like it has just created the readMappings files but nothing is imported into the database and I don't see how to continue to get that done.

(Alternatively I tried to follow the documented workflow from: https://bitbucket.org/bucklerlab/practicalhaplotypegraph/wiki/UserInstructions/ImputeWithPHG_VCF but after SNPToReadMappingPlugin says to run FastqToMappingPlugin however I don't have the Fastq files (reason of why one would run the inputType=vcf option), therefore I cannot continue with the workflow.)

Thanks!

phg tassel • 996 views

ADD COMMENT • link updated 3.8 years ago by pjb39 ▴ 220 • written 3.8 years ago by dovi ▴ 60

score 2 · Accepted Answer · 2021-02-08

The developer who wrote the Impute pipeline thought that the SNPToReadMappingPlugin would write directly to the database but it does not. Instead, it writes the read mappings to files as you noticed. We discovered that a short time ago. There is code in testing to allow SNPToReadMappingPlugin to write directly to the database in testing and that will be in a future release fairly soon, but I do not know exactly when. In the meantime, you can run ImportReadMappingToDBPlugin to load the read mapping files to your PHG database. Following are the parameters used by that plugin:

Usage:

ImportReadMappingToDBPlugin <options>

-configFileForFinalDB <final db="" config="" file=""> : File containing lines with data for host=, user=, password= and DB=, DBtype= used for db connection (required)

-loadFromDB <true |="" false=""> : Load from multiple DBs if this is set to true. If false, we will treat the files in the directory to be raw flat files (Default: true)

-inputMappingDir <input mapping="" dir=""> : Directory holding the input files. This can either be DB config files or can be flat read mapping files (required)

-readMappingMethod <read mapping="" method=""> : Read Mapping Method in the DB. Must be consistent across all DBs/files (required)

-outputKeyFile <output key="" file=""> : Output Key File that can be used in path finding. This is optional. If no file path is supplied, no file will be written.

Note that if you supply a value for outputKeyFile, then a keyfile will be written with the read mapping ids, which will be needed for path finding. You will not use loadFromDB. That allows a user to process reads on multiple servers and store the read mappings in multiple copies of the database then consolidate them in a single database.