Hello!
I am trying to do the last few steps of the Practical Haplotype Graph pipeline as described here . I am running PHG v1.2, have loaded the Haplotypes to the database, created the pangenome fasta and am now trying to impute the best paths for different samples using WGS reads in fastq format using the Plugin: -ImputePipelinePlugin -imputeTarget pathToVCF
. When I start this as a test for one sample I get a vcf file as an output (against the reference I am guessing?), but I do not get any information about the calculated path. The keyfiles created by the plugin (_pathKeyFile
and _withMappingIds
) are empty apart from the header. These files, to my understanding, should contain haplotype counts?
If I go into the database and look at the "paths" table, I find it has entries for all the assemblies I used to load the haplotypes plus one for the reference (at least judging by the method_id
). The tables haplotype_counts
and read_mapping_paths
are empty, but I suppose there should be information about the alignment, haplotype counts and path saved there, so there seems to be an issue.
The plugin seems to run till the end (and output a vcf), but I did get one ERROR during its run:
ERROR net.maizegenetics.pangenome.hapCalling.Minimap2Utils - input directory does not contain both of lineA_1.trim.gz and lineA_2.trim.gz. The directory name should not be part of the filename in the keyfile.
How could it do any alignment and output a vcf if it can not find these files? I checked the path listed in the config file:
fastqDir=/PHG/inputDir/imputation/fastq/
And the two files are located at that location.
I would greatly appreciate help or direction into what to check!
Many thanks!
If helpful this is my keyfile:
cultivar flowcell_lane filename filename2
lineA lineA lineA_1.trim.gz lineA_2.trim.gz
And this is my config file:
# Imputation Pipeline parameters for fastq or SAM files
# Required Parameters!!!!!!!
#--- Database ---
host=localHost
user=xxx
password=xxx
DB=/PHG/phg_run2.db
DBtype=sqlite
#--- Used by liquibase to check DB version ---
liquibaseOutdir=/PHG/outputDir/
#--- Used for writing a pangenome reference fasta(not needed when inputType=vcf) ---
pangenomeHaplotypeMethod=assembly_by_anchorwave
pangenomeDir=/PHG/outputDir/pangenome
indexKmerLength=21
indexWindowSize=11
indexNumberBases=90G
#--- Used for mapping reads
inputType=fastq
readMethod=lineA_20230306_run2
keyFile=/PHG/20230306_wgsconfigs/lineA_20230306readMapping_key_file.txt
fastqDir=/PHG/inputDir/imputation/fastq/
samDir=/PHG/inputDir/imputation/sam/
lowMemMode=true
maxRefRangeErr=0.25
outputSecondaryStats=false
maxSecondary=20
fParameter=f15000,16000
minimapLocation=minimap2
#--- Used for path finding
pathHaplotypeMethod=assembly_by_anchorwave
pathMethod=lineA_20230306_run2
maxNodes=1000
maxReads=10000
minReads=1
minTaxa=1
minTransitionProb=0.0005
numThreads=36
probCorrect=0.99
removeEqual=false
splitNodes=true
splitProb=0.99
usebf=true
minP=0.8
# used by diploid path finding only
maxHap=11
maxReadsKB=100
algorithmType=classic
#--- Used to output a vcf file for pathMethod
#~~~ Optional Parameters ~~~
outVcfFile=/PHG/outputDir/align/lineA_20230306_run2_variants.vcf
localGVCFFolder=/PHG/inputDir/loadDB/gvcf
#pangenomeIndexName=**OPTIONAL**
#readMethodDescription=**OPTIONAL**
#pathMethodDescription=**OPTIONAL**
debugDir=/PHG/debugDir/
#bfInfoFile=**OPTIONAL**
Hello,
Thank you so much for the quick response!!
I will try again renaming the fastq files. If I rerun the script with the same readMethod and pathMethod it should overwrite old entries in the database, correct?
I am not sure which taxa information there should be, but the vcf contains the header and seemingly SNPs across all chromosomes of the reference genome (at least judging by the chromosome names).
These are the first few lines of the vcf if you are interested:
My guess is that there is nothing in the DB for the readMethod or pathMethod, so it is ok to use the same. If the pipeline had found the sample correctly it would have added a column for lineA.
Yes, I think. Looking at the "methods" table in the DB there is no entry named the same or similar to the read/pathMethod.
I apprechiate your advice and will try to rerun this step. Thank you!
Sorry, perhaps also as a clarification for later regarding the VCF file.
The REF column will contain the allele of the reference genome, while the ALT contains all the alleles of the pangenome, i.e. those of my assemblies and not of my samples (fastq files based), correct?
1) Is there a way to tell the best calculated path from the VCF? My chromosomes are unfortunately called the same across assemblies, so I would guess the "ASM_Chr" field would not help? Or is the path only saved in the BLOB in the "paths" table inside the DB? If yes, how do I access the BLOB object to create something readable (I unfortunately do not have prior experience with sql)?
2) Is the alignment against the pangenome itself saved somewhere? Since I might want to continue work with it.
Sorry, if this is too far from the first topic, I will create a new post.