gatk LiftoverVCF finishes with no output VCF nor error.
3
0
Entering edit mode
22 months ago
jpablo • 0

Hi, I'm trying to liftover a VCF from the ALFA project in order to get the frequencies in HG19 instead of HG38, but LiftoverVCF (latest version) finishes the execution with no output in the output nor the reject files. No error message either even with VERBOSITY DEBUG (besides 3 match errors, after a long block of correct mappings not reflected in the output VCF).

My source VCF is: https://ftp.ncbi.nih.gov/snp/population_frequency/latest_release/freq.vcf.gz

VCF preparation: I needed to tweak this VCF a little due version requirements: The header states version 4.0 and this version has mandatory GT fields not present in the VCF. I solved it changing the version to 4.1. This VCF also required a chromosome name replacing from RefSeq to UCSC, so I used bcftools annotate --rename-chrs with success..

I also cropped unnecesary records from the VCF and left a text VCF with 3000 lines . The 3000 lines VCF works like a charm and I get the liftover result as expected, but for larger files  (20000 lines, for instance) I get no output. In the real process I have to liftover 16,000,000 records. I tried to reduce the records in RAM with --MAX_RECORDS_IN_RAM 10000 with no success.

GATK version used: 4.3.0.0 Command:

#/var/test/maf/gatk4/gatk-4.3.0.0/gatk LiftoverVcf  \
--java-options " -Xmx10g" \
--CHAIN /var/test/maf/chains/hg38ToHg19.over.chain.gz \
--INPUT /var/test/maf/alfa/3000.vcf  \
--OUTPUT 3000Hg19.vcf \
--REFERENCE_SEQUENCE /var/test/maf/genomes/hg19.fa \
--REJECT reject.vcf \
--VERBOSITY DEBUG \
--MAX_RECORDS_IN_RAM 1000

The reference genome was downloaded from GoldenPath: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz

The chain file comes from UCSC:  http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz  

Program log:

Using GATK jar /var/test/maf/gatk4/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10g -jar /var/test/maf/gatk4/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar LiftoverVcf --CHAIN /var/test/maf/chains/hg38ToHg19.over.chain.gz --INPUT /var/test/maf/alfa/3000.vcf --OUTPUT 3000Hg19.vcf --REFERENCE_SEQUENCE /var/test/maf/genomes/hg19.fa --REJECT reject.vcf --VERBOSITY DEBUG --MAX_RECORDS_IN_RAM 1000
13:36:11.515 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/var/test/maf/gatk4/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Sat Jan 14 13:36:11 UTC 2023] LiftoverVcf --INPUT /var/test/maf/alfa/3000.vcf --OUTPUT 3000Hg19.vcf --CHAIN /var/test/maf/chains/hg38ToHg19.over.chain.gz --REJECT reject.vcf --VERBOSITY DEBUG --MAX_RECORDS_IN_RAM 1000 --REFERENCE_SEQUENCE /var/test/maf/genomes/hg19.fa --WARN_ON_MISSING_CONTIG false --LOG_FAILED_INTERVALS true --WRITE_ORIGINAL_POSITION false --WRITE_ORIGINAL_ALLELES false --LIFTOVER_MIN_MATCH 1.0 --ALLOW_MISSING_FIELDS_IN_HEADER false --RECOVER_SWAPPED_REF_ALT false --TAGS_TO_REVERSE AF --TAGS_TO_DROP MAX_AF --DISABLE_SORT false --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Sat Jan 14 13:36:11 UTC 2023] Executing as root@ip-172-31-93-206.ec2.internal on Linux 5.10.135-122.509.amzn2.x86_64 amd64; OpenJDK 64-Bit Server VM 17.0.5+8-LTS; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.3.0.0
INFO    2023-01-14 13:36:12     LiftoverVcf     Loading up the target reference genome.
INFO    2023-01-14 13:36:24     LiftoverVcf     Lifting variants over and sorting (not yet writing the output file.)
DEBUG   2023-01-14 13:36:25     SnappyLoader    Snappy successfully loaded.
INFO    2023-01-14 13:36:25     LiftOver        Interval chr1:1638937-1638948 failed to match chain 2 because intersection length 11 < minMatchSize 12.0 (0.9166667 < 1.0)
INFO    2023-01-14 13:36:25     LiftOver        Interval chr1:1655886-1655894 failed to match chain 2 because intersection length 5 < minMatchSize 9.0 (0.5555556 < 1.0)
INFO    2023-01-14 13:36:25     LiftOver        Interval chr1:1662738-1662739 failed to match chain 2 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)

Any help will be very appreciated. Thanks in advance!

Juan Pablo

LiftoverVCF gatk • 2.8k views
ADD COMMENT
1
Entering edit mode
22 months ago

work with a smaller input VCF (a few variants...) to check everything is OK. Check you have enough memory (10G) , check the process wasn't killed by the server.

ADD COMMENT
0
Entering edit mode

Thank you Pierre. I moved the process to a system with larger memory and worked fine. I was puzzled by the fact that it didn't have any error messages related to memory, but obviously that was the cause.

ADD REPLY
1
Entering edit mode

Please accept the answer so the question is marked solved on the website. To do that, click on the green check mark on the left side of the answer.

ADD REPLY
0
Entering edit mode
22 months ago

I have noticed that your VCF contains indels and Picard LiftoverVcf will not swap reference and alternate for these. The VCF also includes the FORMAT/AC field that will not be updated by Picard LiftoverVcf when a swap between reference and alternate allele takes place. I have recently implemented a tool that will automatically handle these shortcomings. The tool is available here and binaries are available here (for this particular VCF make sure that you are running the development version or a version newer than 2022-12-21). There are no memory requirements other than those needed to sort a VCF (<1GB). One minor shortcoming is that you will need the fasta file for both references

After installing the required binaries and downloading the liftover resources:

wget http://ftp.ncbi.nih.gov/snp/population_frequency/latest_release/freq.vcf.gz
wget http://ftp.ncbi.nih.gov/snp/population_frequency/latest_release/freq.vcf.gz.tbi
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromAlias.txt
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
samtools faidx hg38.fa
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
gunzip hg19.fa.gz
samtools faidx hg19.fa

You can simply perform the liftover as follows (it should take about an hour to run):

awk -F"\t" 'NR>1 {print $5"\t"$1}' hg38.chromAlias.txt | \
bcftools annotate -Ou --rename-chrs - freq.vcf.gz | \
bcftools norm -Ou -f hg38.fa -c x -T <(awk '{print $1"\t1\t"$2}' hg38.fa.fai) | \
bcftools +liftover -Ou -- -s hg38.fa -f hg19.fa -c hg38ToHg19.over.chain.gz | \
bcftools sort -Oz --temp-dir ./bcftools. | \
tee freq.hg19.vcf.gz | \
bcftools index -f -t -o freq.hg19.vcf.gz.tbi

Notice that some variants from contigs not present in hg38 are automatically dropped (42,814 out of 904,167,063)

The output will include lines like this:

Writing to ./bcftools.XXXXXX
FORMAT/AC is handled by AC rule
Lines   total/split/realigned/skipped:  904123186/0/0/18
Lines   total/swapped/reference added/rejected: 904123168/115059/49619/16617302
Merging 386 temporary files

One line indicates that 18 variants were dropped by bcftools norm due to mismatches with the refefence (mostly due to IUPAC bases in the VCF, which is not allowed by the VCF specification) and one line gives you a summary of the liftover indicating:

  • 904,123,168 variants total
  • 115,059 variants for which a reference⇆alternate allele swap was required
  • 49,619 variants for which a new reference allele was added as neither of the provided alleles matched the new reference
  • 16,617,302 variants dropped due to their coordinates missing from the chain

Notice that for swapped and reference added variants the FORMAT/AC field will be updated

ADD COMMENT
0
Entering edit mode
19 months ago
Afrooz • 0

Make sure the chr format in your input is in line with the target reference. For instance if your input is "chr22" but your reference is "22" you will not get any leftovers.

ADD COMMENT

Login before adding your answer.

Traffic: 1468 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6