Java -xmx option does not limit memory usage in the ImputePipelinePlugin?

Entering edit mode

2.4 years ago

twrl8 • 0

Hello!

I am currently trying to run the pathfinding step in the PHG pipeline (v1.2) by using the -ImputePipelinePlugin -imputeTarget pathToVCF options, but might be running into memory issues.

I used the command in this way:

singularity exec -B /netscratch:/netscratch phg_1_2.simg /tassel-5-standalone/run_pipeline.pl -Xmx150G -debug -configParameters /PHG/pathfinding_config.txt -ImputePipelinePlugin -imputeTarget pathToVCF -endPlugin

I might missunderstand this, but should the -Xmx option not limit the amount of memory the job can be using? After about a day of running my job now seems to use 320GB of memory and I am worried this might increase even more and potentially reach the memory limit of the machine I'm using or cause other peoples jobs running on it to crash.

Is there a way to estimate roughly how much memory this job will use in the end?
E.g. by looking at the size of the pangenome fasta (78GB). I am currently running imputation for one sample (paired end reads, 2 gzipped fastq files à 30GB, ~430million reads). In the config file I chose numThreads=70 (but did not include any xmx parameter there).

Does someone have prior experience with this?
Many thanks in advance!!

PHG phg • 2.0k views

ADD COMMENT • link 2.4 years ago by twrl8 • 0

Entering edit mode

Based on the program name run_pipeline.pl this appears to be a perl script. Option you are referring to is for Java. Is that perl script calling some java code? Otherwise including that option does nothing for the perl code unless it is required for singularity (not a user myself).

ADD REPLY • link 2.4 years ago by GenoMax 153k

Entering edit mode

Does the amount of memory keep increasing, or does it start high and remain stable? Do you have a log file you can post? That may give us information on what is allocated by/for Singularity vs what is allocated for the PHG java code.

ADD REPLY • link 2.4 years ago by lcj34 ▴ 420

Entering edit mode

When it started it relatively quickly went up to 200GB, then steadily increased to now 327GB. So yes, it still seems to be increasing.

Do you mean the console output? It has already started minimap2, so that might be what is using so much memory?

(Apologies, the output is very long, but it just continues increasing the number of Processed alignments. Up to 1946000000 so far.)

	INFO: Converting SIF file to temporary sandbox...
	WARNING: underlay of /etc/localtime required more than 50 (101) bind mounts
	/tassel-5-standalone/lib/kotlin-stdlib-jdk8-1.4.32.jar:/tassel-5-standalone/lib/scala-library-2.10.1.jar:/tassel-5-standalone/lib/biojava-core-6.0.4.jar:/tassel-5-standalone/lib/phg.jar:/tassel-5-standalone/lib/ejml-ddense-0.41.jar:/tassel-5-standalone/lib/sshj-0.32.0.jar:/tassel-5-standalone/lib/slf4j-simple-1.7.10.jar:/tassel-5-standalone/lib/jfreesvg-3.2.jar:/tassel-5-standalone/lib/itextpdf-5.1.0.jar:/tassel-5-standalone/lib/jfreechart-1.0.19.jar:/tassel-5-standalone/lib/sqlite-jdbc-3.39.2.1.jar:/tassel-5-standalone/lib/forester-1.039.jar:/tassel-5-standalone/lib/biojava-genome-6.0.4.jar:/tassel-5-standalone/lib/colt-1.2.0.jar:/tassel-5-standalone/lib/slf4j-api-1.7.10.jar:/tassel-5-standalone/lib/fastutil-8.2.2.jar:/tassel-5-standalone/lib/junit-4.10.jar:/tassel-5-standalone/lib/ahocorasick-0.2.4.jar:/tassel-5-standalone/lib/jhdf5-14.12.5.jar:/tassel-5-standalone/lib/biojava-phylo-4.2.12.jar:/tassel-5-standalone/lib/gs-core-1.3.jar:/tassel-5-standalone/lib/trove-3.0.3.jar:/tassel-5-standalone/lib/ejml-core-0.41.jar:/tassel-5-standalone/lib/kotlinx-coroutines-core-jvm-1.4.3.jar:/tassel-5-standalone/lib/commons-io-2.11.0.jar:/tassel-5-standalone/lib/gs-ui-1.3.jar:/tassel-5-standalone/lib/snappy-java-1.1.8.4.jar:/tassel-5-standalone/lib/javax.json-1.0.4.jar:/tassel-5-standalone/lib/json-simple-1.1.1.jar:/tassel-5-standalone/lib/jcommon-1.0.23.jar:/tassel-5-standalone/lib/biojava-alignment-6.0.4.jar:/tassel-5-standalone/lib/kotlin-stdlib-jdk7-1.4.32.jar:/tassel-5-standalone/lib/postgresql-9.4-1201.jdbc41.jar:/tassel-5-standalone/lib/mail-1.4.jar:/tassel-5-standalone/lib/ini4j-0.5.4.jar:/tassel-5-standalone/lib/log4j-1.2.13.jar:/tassel-5-standalone/lib/kotlin-stdlib-1.4.32.jar:/tassel-5-standalone/lib/commons-codec-1.10.jar:/tassel-5-standalone/lib/commons-math3-3.4.1.jar:/tassel-5-standalone/lib/htsjdk-2.24.1.jar:/tassel-5-standalone/lib/guava-22.0.jar:/tassel-5-standalone/sTASSEL.jar
	Memory Settings: -Xms512m -Xmx150G
	Tassel Pipeline Arguments: -debug -configParameters /PHG/20230310_wgsconfigs/lineA_20230310_config.txt -ImputePipelinePlugin -imputeTarget pathToVCF -endPlugin
	[main] INFO net.maizegenetics.plugindef.ParameterCache - load: loading parameter cache with: /PHG/20230310_wgsconfigs/lineA_20230310_config.txt
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: DBtype value: sqlite
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: maxSecondary value: 20
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: host value: localHost
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: minTransitionProb value: 0.0005
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: DB value: /PHG/phg_run2.db
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: inputType value: fastq
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: pangenomeDir value: /PHG/outputDir/pangenome
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: outputSecondaryStats value: false
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: maxNodes value: 1000
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: samDir value: /PHG/inputDir/imputation/sam/
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: fParameter value: f15000,16000
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: minimapLocation value: minimap2
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: numThreads value: 70
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: pathMethod value: lineA_20230320_run2
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: probCorrect value: 0.99
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: indexNumberBases value: 90G
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: pangenomeHaplotypeMethod value: assembly_by_anchorwave
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: minTaxa value: 1
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: indexKmerLength value: 21
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: usebf value: true
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: removeEqual value: false
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: pathHaplotypeMethod value: assembly_by_anchorwave
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: minReads value: 1
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: keyFile value: /PHG/20230310_wgsconfigs/lineA_20230310readMapping_key_file.txt
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: user value: BarleyTEs
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: lowMemMode value: true
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: debugDir value: /PHG/debugDir/
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: password value: MaizeTEs
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: splitNodes value: true
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: outVcfFile value: /PHG/outputDir/align/lineA_20230320_run2_variants.vcf
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: maxReads value: 10000
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: maxRefRangeErr value: 0.25
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: indexWindowSize value: 11
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: minP value: 0.8
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: splitProb value: 0.99
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: fastqDir value: /PHG/inputDir/imputation/fastq/
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: readMethod value: lineA_20230320_run2
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: configFile value: /PHG/20230310_wgsconfigs/lineA_20230310_config.txt
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: liquibaseOutdir value: /PHG/outputDir/
	[main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: localGVCFFolder value: /PHG/inputDir/loadDB/gvcf
	[main] INFO net.maizegenetics.tassel.TasselLogging - Tassel Version: 5.2.86 Date: October 11, 2022
	[main] INFO net.maizegenetics.tassel.TasselLogging - Max Available Memory Reported by JVM: 136533 MB
	[main] INFO net.maizegenetics.tassel.TasselLogging - Java Version: 1.8.0_242
	[main] INFO net.maizegenetics.tassel.TasselLogging - OS: Linux
	[main] INFO net.maizegenetics.tassel.TasselLogging - Number of Processors: 128
	[main] INFO net.maizegenetics.pipeline.TasselPipeline - Tassel Pipeline Arguments: [-fork1, -ImputePipelinePlugin, -imputeTarget, pathToVCF, -endPlugin, -runfork1]
	net.maizegenetics.pangenome.pipeline.ImputePipelinePlugin
	[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Starting net.maizegenetics.pangenome.pipeline.ImputePipelinePlugin: time: Mar 20, 2023 16:40:6
	[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin -
	ImputePipelinePlugin Parameters
	imputeTarget: pathToVCF
	inputType: fastq
	configFile: /PHG/20230310_wgsconfigs/lineA_20230310_config.txt
	pangenomeHaplotypeMethod: assembly_by_anchorwave
	pathHaplotypeMethod: assembly_by_anchorwave
	pangenomeDir: /PHG/outputDir/pangenome
	pangenomeIndexName: null
	indexKmerLength: 21
	indexWindowSize: 11
	indexNumberBases: 90G
	minimapLocation: minimap2
	readMethod: lineA_20230320_run2
	readMethodDescription: null
	outVcfFile: /PHG/outputDir/align/lineA_20230320_run2_variants.vcf
	forceDBUpdate: false
	liquibaseOutdir: /PHG/outputDir/
	skipLiquibaseCheck: false
	localGVCFFolder: /PHG/inputDir/loadDB/gvcf

	[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.ImputePipelinePlugin - Checking if Liquibase can be run.
	[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Starting net.maizegenetics.pangenome.liquibase.CheckDBVersionPlugin: time: Mar 20, 2023 16:40:6
	[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin -
	CheckDBVersionPlugin Parameters
	outputDir: /PHG/outputDir/

	[pool-1-thread-1] INFO net.maizegenetics.pangenome.liquibase.CheckDBVersionPlugin - Deleting yesFile /PHG/outputDir//run_yes.txt if it exists
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.liquibase.CheckDBVersionPlugin - Deleting noFile /PHG/outputDir/run_no.txt if it exists
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - first connection: dbName from config file = /PHG/phg_run2.db host: localHost user: BarleyTEs type: sqlite
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - Database URL: jdbc:sqlite:/PHG/phg_run2.db
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - Connected to database:

	[pool-1-thread-1] INFO net.maizegenetics.pangenome.liquibase.CheckDBVersionPlugin - queueHaplotypeNodesByRange: query: select name FROM sqlite_master where type='table' and name='variants';
	[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Finished net.maizegenetics.pangenome.liquibase.CheckDBVersionPlugin: time: Mar 20, 2023 16:40:7
	[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - net.maizegenetics.pangenome.liquibase.CheckDBVersionPlugin Citation: Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. (2007) TASSEL: Software for association mapping of complex traits in diverse samples. Bioinformatics 23:2633-2635.
	[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Starting net.maizegenetics.pangenome.liquibase.LiquibaseUpdatePlugin: time: Mar 20, 2023 16:40:7
	[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin -
	LiquibaseUpdatePlugin Parameters
	outputDir: /PHG/outputDir/
	command: status

	[pool-1-thread-1] INFO net.maizegenetics.pangenome.liquibase.LiquibaseUpdatePlugin - Please wait, begin Command:liquibase --driver=org.sqlite.JDBC --url=jdbc:sqlite:/PHG/phg_run2.db --username=BarleyTEs --password=MaizeTEs --changeLogFile=changelogs/db.changelog-master.xml status --verbose
	[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Finished net.maizegenetics.pangenome.liquibase.LiquibaseUpdatePlugin: time: Mar 20, 2023 16:40:10
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.ImputePipelinePlugin - PHG DB is up to date. Proceeding with Populating the PHG DB.
	[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Starting net.maizegenetics.pangenome.api.HaplotypeGraphBuilderPlugin: time: Mar 20, 2023 16:40:10
	[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin -
	HaplotypeGraphBuilderPlugin Parameters
	configFile: /PHG/20230310_wgsconfigs/lineA_20230310_config.txt
	methods: assembly_by_anchorwave
	includeSequences: true
	includeVariantContexts: false
	haplotypeIds: null
	chromosomes: null
	taxa: null
	localGVCFFolder: /PHG/inputDir/loadDB/gvcf

	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - first connection: dbName from config file = /PHG/phg_run2.db host: localHost user: BarleyTEs type: sqlite
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - Database URL: jdbc:sqlite:/PHG/phg_run2.db
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - Connected to database:

	[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - referenceRangesAsMap: query statement: select reference_ranges.ref_range_id, chrom, range_start, range_end, methods.name from reference_ranges INNER JOIN ref_range_ref_range_method on ref_range_ref_range_method.ref_range_id=reference_ranges.ref_range_id INNER JOIN methods on ref_range_ref_range_method.method_id = methods.method_id AND methods.method_type = 7 ORDER BY reference_ranges.ref_range_id
	methods size: 1
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - referenceRangesAsMap: number of reference ranges: 422593
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - referenceRangesAsMap: time: 9.720996581 secs.
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - taxaListMap: query statement: SELECT gamete_haplotypes.gamete_grp_id, genotypes.line_name FROM gamete_haplotypes INNER JOIN gametes ON gamete_haplotypes.gameteid = gametes.gameteid INNER JOIN genotypes on gametes.genoid = genotypes.genoid ORDER BY gamete_haplotypes.gamete_grp_id;
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - taxaListMap: number of taxa lists: 20
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - taxaListMap: time: 0.026761193 secs.
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - createHaplotypeNodes: haplotype method: assembly_by_anchorwave range group method: null
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - createHaplotypeNodes: query statement: SELECT haplotypes_id, gamete_grp_id, haplotypes.ref_range_id, asm_contig, asm_start_coordinate, asm_end_coordinate, asm_strand, genome_file_id, sequence, seq_has...
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - CreateGraphUtils:addNodes - query=SELECT haplotypes_id, gamete_grp_id, haplotypes.ref_range_id, asm_contig, asm_start_coordinate, asm_end_coordinate, asm_strand, genome_file_id, sequence, seq_hash, seq_len FROM haplotypes WHERE method_id = 4;
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - addNodes: number of nodes: 7920571
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - addNodes: number of reference ranges: 421670
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - createHaplotypeNodes: time: 278.25256257 secs.
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.HaplotypeGraph - Created graph edges: created when requested number of nodes: 7920571 number of reference ranges: 421670
	[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Finished net.maizegenetics.pangenome.api.HaplotypeGraphBuilderPlugin: time: Mar 20, 2023 16:45:19
	[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Starting net.maizegenetics.pangenome.hapCalling.FastqToMappingPlugin: time: Mar 20, 2023 16:45:19
	[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin -
	FastqToMappingPlugin Parameters
	minimap2IndexFile: /PHG/outputDir/pangenome/pangenome_assembly_by_anchorwave_k21w11I90G.mmi
	keyFile: /PHG/20230310_wgsconfigs/lineA_20230310readMapping_key_file.txt
	fastqDir: /PHG/inputDir/imputation/fastq/
	maxRefRangeErr: 0.25
	lowMemMode: true
	maxSecondary: 20
	fParameter: f15000,16000
	minimapLocation: minimap2
	methodName: lineA_20230320_run2
	methodDescription: null
	debugDir: /PHG/debugDir/
	outputSecondaryStats: false
	isTestMethod: false

	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - first connection: dbName from config file = /PHG/phg_run2.db host: localHost user: BarleyTEs type: sqlite
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - Database URL: jdbc:sqlite:/PHG/phg_run2.db
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - Connected to database:

	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess - PHGdbAccess - db is setup, init prepared statements, load hash table
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess -
	beginning - isSqlite is true
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess - before loading hash, size of all geneotypes in genotype table=20
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess - refRangeRefRangeIDMap is null, creating new one with size : 422593
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess - loadAnchorHash: at end, size of refRangeRefRangeIDMap: 422593, number of rs.next processed: 422593
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess - before loading hash, size of all methods in method table=5
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess - before loading hash, size of all groups in taxa_groups table=0
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess - before loading hash, size of all groups in gamete_groups table=20
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess - before loading hash, size of all gametes in gametes table=20
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess - putHaplotypeListData - at end, haplotypeListId = 1
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - runMinimapFromKeyFile: calling updateReadMappingHash()
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess - before loading readMappingHash, size of all read_mappings in read_mapping table=0
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - runMinimapFromKeyFile, updateReadMappingHash took 3.56044E-4 seconds
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Setting up MinimapRun for: cultivar lineA, flowcell_lane lineA.
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Running Minimap2 Command:
	minimap2 -ax sr -t 126 --secondary=yes -N20 -f15000,16000 --eqx /PHG/outputDir/pangenome/pangenome_assembly_by_anchorwave_k21w11I90G.mmi /PHG/inputDir/imputation/fastq/lineA_1_trim.fastq.gz /PHG/inputDir/imputation/fastq/lineA_2_trim.fastq.gz
	[M::main::420.146*0.41] loaded/built the index for 7920571 target sequence(s)
	[M::mm_mapopt_update::420.146*0.41] mid_occ = 15000
	[M::mm_idx_stat] kmer size: 21; skip: 11; is_hpc: 0; #seq: 7920571
	[M::mm_idx_stat::432.263*0.42] distinct minimizers: 399128275 (18.61% are singletons); average occurrences: 33.020; average spacing: 6.329
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Time spent setting up run: Taxon:lineA : 580.650659487sec
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Running in Low memory mode. Simply counting the number of reads which hit a given set of haplotype ids
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Processed 1000000 alignments.
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Processed 2000000 alignments.
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Processed 3000000 alignments.
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Processed 4000000 alignments.
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Processed 5000000 alignments.
	[M::worker_pipeline::798.307*7.22] mapped 334974 sequences
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Processed 6000000 alignments.
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Processed 7000000 alignments.
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Processed 8000000 alignments.
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Processed 9000000 alignments.
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Processed 10000000 alignments.
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Processed 11000000 alignments.
	[M::worker_pipeline::901.595*8.42] mapped 335006 sequences
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Processed 12000000 alignments.
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Processed 13000000 alignments.
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Processed 14000000 alignments.
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Processed 15000000 alignments.
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Processed 16000000 alignments.
	[pool-1-thread-1] INFO net.maizegenetics.pangenome.hapCalling.Minimap2Utils - Processed 17000000 alignments.
	[M::worker_pipeline::1019.984*9.18] mapped 334934 sequences

view raw myPHGlog.txt hosted with ❤ by GitHub

ADD REPLY • link 2.4 years ago by twrl8 • 0

Entering edit mode

Can you trim some of the log output on GitHub gist? We get the idea of what is happening.

minimap is using -t 126 (126 threads) so that may also have something to do with memory usage.

ADD REPLY • link 2.4 years ago by GenoMax 153k

Entering edit mode

Done. Many apologies!
And thank you for checking.

Yes, there seems to be something up with that

ADD REPLY • link 2.4 years ago by twrl8 • 0

Entering edit mode

2.4 years ago

zrm22 ▴ 40

The -Xmx flag will limit the JVM heap space for the java process called within run_pipeline.pl. The issue here is the the ImputePipelinePlugin needs to execute minimap2 which is executed on a different system process than what the JVM is running on. To my understanding minimap2 will use all available RAM if it needs to.

So I think you have a few options

Lower number of threads - should make minimap2 use less RAM, but you will still need to load in the index file which gets fairly large
Limit the memory allocated to the singularity container -Singularity Documentation . This should force anything run within the container to be limited to your request. However once it hits that cap, it will likely stop.

Just a note, we are currently investigating and implementing a new version of the Fastq -> ReadMapping file step which is likely what you are running into here. This new version completely bypasses minimap2 and uses Kmers to figure out the read mappings. Initial testing is very promising as the RAM usage is far lower(10-20GB) and the speed is very good(1-2 minutes for a 2-3x WGS paired end fastq pair) and the results are close enough to what minimap2 provides that the Path finding is nearly identical. Hopefully this will be included in the next version of the PHG.

ADD COMMENT • link 2.4 years ago by zrm22 ▴ 40

Entering edit mode

Ahh thank you!

I thought the xmx parameter would be passed to the downstream commands called by the plugin. Then I think I definitely need to limit Singularity, since I can't take up all the memory on this machiene.

With the thread number, as GenoMax pointed out minimap uses 126 threads. This is 2 less than my machine has, so that would fit well with the Documentation saying:

The number of threads that will be used to impute individual paths is numThreads - 2 because 2 threads are reserved for other operations.

However in the config file I set it to numThreads=70, so could there be something going wrong or something I didn't set that prevents this parameter to be passed to minimap?

That update does sound very enticing! Since I need to do this for a lot more samples and the previous test I started ran for over a week before crashing due to memory (since someone else was also using it heavily), so increasing the speed while reducing memory requirements sounds fantastic.
I apologise, this is probably unfair to ask since these things simply take their time, but is there a rough idea when that next update would be published?

ADD REPLY • link 2.4 years ago by twrl8 • 0

Entering edit mode

It looks like the ImputePipelinePlugin for the minimap2 run does not use the numThreads Option but rather does the numThreadsOnMachine - 2 as mentioned in the documentation. My intuition is that you can likely lower the number of cpus that singularity has access to and that might do what you need it to do. I will add a ticket for us to add the parameter for the minimap2 runs. It would definitely be nice to allow the user to change this easily.

For the timeline for the next update, my goal is to get this module released in the next coming months. If the algorithm has been fully tested and works, we should have it out by end of summer at the latest. I think we can likely have it ready for other people to test in April, but we may have more pressing things come up which would delay.

ADD REPLY • link 2.4 years ago by zrm22 ▴ 40

Entering edit mode

Thank you very much! I will try using the singularity options.

Thank you for that aswell. I will keep an eye on the docker hub for newer versions, since this could really help me. I think docker hub shows the code added, though is there anywhere to see which functionalities it brings?

ADD REPLY • link 2.4 years ago by twrl8 • 0