Hi, I'm trying to liftover a VCF from the ALFA project in order to get the frequencies in HG19 instead of HG38, but LiftoverVCF (latest version) finishes the execution with no output in the output nor the reject files. No error message either even with VERBOSITY DEBUG (besides 3 match errors, after a long block of correct mappings not reflected in the output VCF).
My source VCF is: https://ftp.ncbi.nih.gov/snp/population_frequency/latest_release/freq.vcf.gz
VCF preparation: I needed to tweak this VCF a little due version requirements: The header states version 4.0 and this version has mandatory GT fields not present in the VCF. I solved it changing the version to 4.1. This VCF also required a chromosome name replacing from RefSeq to UCSC, so I used bcftools annotate --rename-chrs with success..
I also cropped unnecesary records from the VCF and left a text VCF with 3000 lines . The 3000 lines VCF works like a charm and I get the liftover result as expected, but for larger files (20000 lines, for instance) I get no output. In the real process I have to liftover 16,000,000 records. I tried to reduce the records in RAM with --MAX_RECORDS_IN_RAM 10000 with no success.
GATK version used: 4.3.0.0 Command:
#/var/test/maf/gatk4/gatk-4.3.0.0/gatk LiftoverVcf \
--java-options " -Xmx10g" \
--CHAIN /var/test/maf/chains/hg38ToHg19.over.chain.gz \
--INPUT /var/test/maf/alfa/3000.vcf \
--OUTPUT 3000Hg19.vcf \
--REFERENCE_SEQUENCE /var/test/maf/genomes/hg19.fa \
--REJECT reject.vcf \
--VERBOSITY DEBUG \
--MAX_RECORDS_IN_RAM 1000
The reference genome was downloaded from GoldenPath: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
The chain file comes from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz
Program log:
Using GATK jar /var/test/maf/gatk4/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10g -jar /var/test/maf/gatk4/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar LiftoverVcf --CHAIN /var/test/maf/chains/hg38ToHg19.over.chain.gz --INPUT /var/test/maf/alfa/3000.vcf --OUTPUT 3000Hg19.vcf --REFERENCE_SEQUENCE /var/test/maf/genomes/hg19.fa --REJECT reject.vcf --VERBOSITY DEBUG --MAX_RECORDS_IN_RAM 1000
13:36:11.515 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/var/test/maf/gatk4/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Sat Jan 14 13:36:11 UTC 2023] LiftoverVcf --INPUT /var/test/maf/alfa/3000.vcf --OUTPUT 3000Hg19.vcf --CHAIN /var/test/maf/chains/hg38ToHg19.over.chain.gz --REJECT reject.vcf --VERBOSITY DEBUG --MAX_RECORDS_IN_RAM 1000 --REFERENCE_SEQUENCE /var/test/maf/genomes/hg19.fa --WARN_ON_MISSING_CONTIG false --LOG_FAILED_INTERVALS true --WRITE_ORIGINAL_POSITION false --WRITE_ORIGINAL_ALLELES false --LIFTOVER_MIN_MATCH 1.0 --ALLOW_MISSING_FIELDS_IN_HEADER false --RECOVER_SWAPPED_REF_ALT false --TAGS_TO_REVERSE AF --TAGS_TO_DROP MAX_AF --DISABLE_SORT false --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Sat Jan 14 13:36:11 UTC 2023] Executing as root@ip-172-31-93-206.ec2.internal on Linux 5.10.135-122.509.amzn2.x86_64 amd64; OpenJDK 64-Bit Server VM 17.0.5+8-LTS; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.3.0.0
INFO 2023-01-14 13:36:12 LiftoverVcf Loading up the target reference genome.
INFO 2023-01-14 13:36:24 LiftoverVcf Lifting variants over and sorting (not yet writing the output file.)
DEBUG 2023-01-14 13:36:25 SnappyLoader Snappy successfully loaded.
INFO 2023-01-14 13:36:25 LiftOver Interval chr1:1638937-1638948 failed to match chain 2 because intersection length 11 < minMatchSize 12.0 (0.9166667 < 1.0)
INFO 2023-01-14 13:36:25 LiftOver Interval chr1:1655886-1655894 failed to match chain 2 because intersection length 5 < minMatchSize 9.0 (0.5555556 < 1.0)
INFO 2023-01-14 13:36:25 LiftOver Interval chr1:1662738-1662739 failed to match chain 2 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
Any help will be very appreciated. Thanks in advance!
Juan Pablo
Thank you Pierre. I moved the process to a system with larger memory and worked fine. I was puzzled by the fact that it didn't have any error messages related to memory, but obviously that was the cause.
Please accept the answer so the question is marked solved on the website. To do that, click on the green check mark on the left side of the answer.