Question

error when running phg create-maf-vcf to build a built practical haplotype graph (v2)

0

Entering edit mode

8 months ago

b.contreras.moreira ▴ 310

Hi, after successfully running

phg create-ref-vcf \
--bed /data/phg/data/reference/ref_ranges.bed \
--reference-file ... \
--reference-name ... \
--db-path /data/phg

The next step failed, can't see any other log message that I can post here:

(phgv2-conda) $ phg create-maf-vcf     --db-path /data/phg     --bed /data/phg/data/reference/ref_ranges.bed     --reference-file ...fa     --maf-dir /data/phg/data/alignment_files     -o  /data/phg/data/vcf_files

[main] INFO net.maizegenetics.phgv2.utils.VariantLoadingUtils 2024-03-19 13:35:51,445: begin command:conda run -n phgv2-conda tiledbvcf stat --uri /data/phg/hvcf_dataset
[main] INFO net.maizegenetics.phgv2.utils.VariantLoadingUtils 2024-03-19 13:35:52,960: Using  TileDB datasets created in folder /data/phg.
[main] INFO net.maizegenetics.phgv2.cli.CreateMafVcf 2024-03-19 13:35:53,370: CreateASMHvcfs: calling buildRefGenomeSeq
[main] INFO net.maizegenetics.phgv2.cli.CreateMafVcf 2024-03-19 13:36:38,240: CreateASMHvcfs: processing /data/phg/data/alignment_files/HOR13942.maf
getVariantContextsfromMAF: Processing a single genome
Exception in thread "main" java.lang.NullPointerException
at biokotlin.genome.MAFToGVCF.buildRefBlockVariantInfo(MAFToGVCF.kt:990)
at biokotlin.genome.MAFToGVCF.buildTempVariants(MAFToGVCF.kt:542)
at biokotlin.genome.MAFToGVCF.buildVariantsForAllAlignments(MAFToGVCF.kt:246)
at biokotlin.genome.MAFToGVCF.getVariantContextsfromMAF(MAFToGVCF.kt:167)
at biokotlin.genome.MAFToGVCF.getVariantContextsfromMAF$default(MAFToGVCF.kt:136)
at net.maizegenetics.phgv2.cli.CreateMafVcf.getGVCFVariantsFromMafFile(CreateMafVcf.kt:142)
at net.maizegenetics.phgv2.cli.CreateMafVcf.getGVCFVariantsFromMafFile$default(CreateMafVcf.kt:141)
at net.maizegenetics.phgv2.cli.CreateMafVcf.createASMHvcfs(CreateMafVcf.kt:80)
at net.maizegenetics.phgv2.cli.CreateMafVcf.createASMHvcfs$default(CreateMafVcf.kt:67)
at net.maizegenetics.phgv2.cli.CreateMafVcf.run(CreateMafVcf.kt:611)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:306)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:319)
at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:40)
at com.github.ajalt.clikt.core.CliktCommand.parse(CliktCommand.kt:458)
at com.github.ajalt.clikt.core.CliktCommand.parse$default(CliktCommand.kt:455)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:475)
at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:482)
at net.maizegenetics.phgv2.cli.PhgKt.main(Phg.kt:45)

Any help appreciated, Bruno

pangenome phg_v2 plants PHG • 2.3k views

ADD COMMENT • link 7 months ago by b.contreras.moreira ▴ 310

0

Entering edit mode

WIll you check your reference fasta and the maf file to verify the chromosome names for the MAF reference matches the chromosome names used in the reference fasta?

ADD REPLY • link 8 months ago by lcj34 ▴ 420

0

Entering edit mode

Please see below lcj34 :

grep "^>" /data/phg/data/reference/reference.fa

>chr1H_LR890096.1
>chr2H_LR890097.1
>chr3H_LR890098.1
>chr4H_LR890099.1
>chr5H_LR890100.1
>chr6H_LR890101.1
>chr7H_LR890102.1
>CAJHDD010000001.1
>CAJHDD010000002.1
...
>CAJHDD010000283.1

And:

perl -lane 'print $F[1] if(/^s/)' data/alignment_files/assembly.maf | sort -u 
chr1H
chr1H_LR890096.1
chr2H
chr2H_LR890097.1
chr3H
chr3H_LR890098.1
chr4H
chr4H_LR890099.1
chr5H
chr5H_LR890100.1
chr6H
chr6H_LR890101.1
chr7H
chr7H_LR890102.1
chrUn

Only thing I can see is that unplaced contigs in the reference (CAJHDD...) do not appear in the MAF file

ADD REPLY • link 8 months ago by b.contreras.moreira ▴ 310

0

Entering edit mode

In the second listing - I see they are sorted, so I cannot tell if the ref is chr1H or chr1H_LR890096.1. If the former, this would be the problem. (inconsistent ref naming). If the latter, then something else is going on . If the ref contig names are the same, the send me privately your reference file and MAF file and I'll try to debug.

ADD REPLY • link 8 months ago by lcj34 ▴ 420

0

Entering edit mode

Hi the MAF file in question looks like this:

(phgv2-conda) $ perl -lne 'print substr($_,0,60)' data/alignment_files/assembly.maf       
##maf version=1
a   score=-474
s   chr4H_LR890099.1                                610201403      
s   chr4H                                           617685547      

a   score=-35615
s   chr1H_LR890096.1                                    76743      
s   chrUn                                              452495      
...

I will email you the tarred files for your debugging, thanks for the offer, Bruno

ADD REPLY • link 8 months ago by b.contreras.moreira ▴ 310

0

Entering edit mode

I am re-running leaving out unplaced contigs to see if that fixes it, will let you know

ADD REPLY • link 8 months ago by b.contreras.moreira ▴ 310

1

Entering edit mode

Bruno - that won't fix it. I found a problem with the BioKotlin library that phgv2 calls. I have fixed the library (also Buckler Lab code) and tested with your data. Once my fix is approved I will create a new jar file, then we'll rebuild phgv2 with this new code and make that available.

On a related note, the latest version of the phgv2 documents (as of today) are clearer on the requirements of running the annotate-fastas command. This is necessary to keep the AGC compressed assemblies (sequence) and tiledb datasets (variants) in sync with the same sample names. I don't know if you have run that step, but it shoudl be run prior to running agc-compress or align-assemblies. See the documents for further details.

ADD REPLY • link 8 months ago by lcj34 ▴ 420

0

Entering edit mode

BTW - we appreciate you testing phgv2 and your patience as we work out issues.

ADD REPLY • link 8 months ago by lcj34 ▴ 420

0

Entering edit mode

As agreed by email, I have now updated the phg code to version 2.2.65.114. Now I can go past that stage, only to fail with bcftools index:

[main] INFO net.maizegenetics.phgv2.utils.VariantLoadingUtils 2024-03-22 17:08:16,949: begin Command:conda run -n phgv2-conda tiledbvcf stat --uri /data/phg/hvcf_dataset
[main] INFO net.maizegenetics.phgv2.utils.VariantLoadingUtils 2024-03-22 17:08:18,546: Using  TileDB datasets created in folder /data/phg.
[main] INFO net.maizegenetics.phgv2.cli.CreateMafVcf 2024-03-22 17:08:19,051: CreateASMHvcfs: calling buildRefGenomeSeq
[main] INFO net.maizegenetics.phgv2.cli.CreateMafVcf 2024-03-22 17:09:04,488: CreateASMHvcfs: processing /data/phg/data/alignment_files/HOR13942.maf
getVariantContextsfromMAF: Processing a single genome
[main] INFO net.maizegenetics.phgv2.cli.CreateMafVcf 2024-03-22 17:24:14,229: createASMHvcfs: gvcfVariants.size == 1
[main] INFO net.maizegenetics.phgv2.cli.CreateMafVcf 2024-03-22 17:24:14,231: createASMHvcfs: processing sampleName = HOR13942
[main] INFO net.maizegenetics.phgv2.utils.VariantLoadingUtils 2024-03-22 17:30:06,941: bgzipping  file /data/phg/data/vcf_files/HOR13942.g.vcf
[main] WARN net.maizegenetics.phgv2.utils.VariantLoadingUtils 2024-03-22 17:35:43,310:
ERROR 255 creating tabix indexed  version of file: /data/phg/data/vcf_files/HOR13942.g.vcf.gz
Exception in thread "main" java.lang.IllegalStateException: bgzipAndIndexGVCFfile: error bgzipping and/or indexing file /data/phg/data/vcf_files/HOR13942.g.vcf: Message:bgzipAndIndexGVCFfile: error trying to run bcftools index -c on file /data/phg/data/vcf_files/HOR13942.g.vcf.gz: 255
...

When I run that last command manualy it says:

[E::hts_idx_push] Chromosome blocks not continuous
index: failed to create index for "/data/phg/data/vcf_files/HOR13942.g.vcf.gz"

ADD REPLY • link 8 months ago by b.contreras.moreira ▴ 310

0

Entering edit mode

You will need to sort the file before indexing.

ADD REPLY • link 8 months ago by GenoMax 147k

0

Entering edit mode

Thanks GenoMax , I would have thought phg took care of that, right lcj34 ?

Edit: I tested sort followed by index and it worked, so I guess whatever the code is trying to index needs to be properly sorted first?

ADD REPLY • link 8 months ago by b.contreras.moreira ▴ 310

0

Entering edit mode

Yes, the gvcf created by phg should have been sorted. I am able to reproduce the problem using your file. Will look into this and let you know.

ADD REPLY • link 8 months ago by lcj34 ▴ 420

0

Entering edit mode

Bruno - I have fixed this issue in our library. The fix won't be available in phgv2 for a couple days as the person who creates our BioKotlin releases (phgv2 uses this library) is out early this week. It should be available by the end of the week - I'll let you know.

ADD REPLY • link 8 months ago by lcj34 ▴ 420

0

Entering edit mode

Thanks lcj34

ADD REPLY • link 8 months ago by b.contreras.moreira ▴ 310

0

Entering edit mode

FYI - phgv2 version Release 2.2.68.117 has the updated BioKotlin release with this fix.

ADD REPLY • link 8 months ago by lcj34 ▴ 420

0

Entering edit mode

Hi lcj34 , I downloaded the latest version and got a different error now:

phg --version
phg version 2.2.68.117
...

[main] INFO net.maizegenetics.phgv2.utils.VariantLoadingUtils 2024-03-30 22:45:05,221: bgzipping  file /data/phg/data/vcf_files/HOR13942.g.vcf
[main] WARN net.maizegenetics.phgv2.utils.VariantLoadingUtils 2024-03-30 22:50:52,213:
ERROR 255 creating tabix indexed  version of file: /data/phg/data/vcf_files/HOR13942.g.vcf.gz
Exception in thread "main" java.lang.IllegalStateException: bgzipAndIndexGVCFfile: error bgzipping and/or indexing file /data/phg/data/vcf_files/HOR13942.g.vcf: Message:bgzipAndIndexGVCFfile: error trying to run bcftools index -c on file /data/phg/data/vcf_files/HOR13942.g.vcf.gz: 255
    at net.maizegenetics.phgv2.utils.VariantLoadingUtilsKt.bgzipAndIndexGVCFfile(VariantLoadingUtils.kt:140)
    at net.maizegenetics.phgv2.cli.CreateMafVcf.createASMHvcfs(CreateMafVcf.kt:88)
    at net.maizegenetics.phgv2.cli.CreateMafVcf.createASMHvcfs$default(CreateMafVcf.kt:67)
    at net.maizegenetics.phgv2.cli.CreateMafVcf.run(CreateMafVcf.kt:611)
    at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:279)
    at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:292)
    at com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:41)
    at com.github.ajalt.clikt.core.CliktCommand.parse(CliktCommand.kt:457)
    at com.github.ajalt.clikt.core.CliktCommand.parse$default(CliktCommand.kt:454)
    at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:474)
    at com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:481)
    at net.maizegenetics.phgv2.cli.PhgKt.main(Phg.kt:46)

ADD REPLY • link 8 months ago by b.contreras.moreira ▴ 310

1

Entering edit mode

Thanks, Bruno. I found a place with a missing sort call (working when BioKotlin was called directly, but when called from phgv2 a sort call was missing). I've fixed that - will let you know when the code has been reviewed and merge.

ADD REPLY • link 7 months ago by lcj34 ▴ 420

0

Entering edit mode

This fix is now available in phgv2 version 2.2.72.121

ADD REPLY • link 7 months ago by lcj34 ▴ 420

0

Entering edit mode

I confirm this now works with my test data, many thanks lcj34 !

ADD REPLY • link 7 months ago by b.contreras.moreira ▴ 310