Hello Biostars community,
We would like to ask whether anyone is able to create an index using hisat2-build with --ss (splice sites) and --exon on the below refseq files. If you have had issues related to this in the past, would also be useful to hear any advice/lessons learned.
Current pain point:
- hisat2-build stalls at generation 4 for 20 hours (log below) despite indication from top, that program is running.
- GCF_000001405.40_GRCh38.p14_genomic.gtf.gz build completes successfully without --ss and --exon.
- In the past, we have successfully buit an index, using these settings on the Gencode v42/41/39 comprehensive fasta and gtf files (with --exon and --ss).
Setup: hisat2 version hisat2 2.2.1 (latest)
version refseq -GCF_000001405.40_GRCh38.p14_genomic.fna.gz 2024-08-27 09:57 928M -GCF_000001405.40_GRCh38.p14_genomic.gtf.gz 2024-08-27 09:57 54M
machine stats Local VM: 20 Core CPU , 256 RAM , 1.6 TB hd space,
Extract exon: hisat2_extract_exons.py GCF_000001405.40_GRCh38.p14_genomic.sorted.gtf > Original_refseq1405.40_extractexon
Extract splice: hisat2_extract_splice_sites.py GCF_000001405.40_GRCh38.p14_genomic.sorted.gtf > Original_refseq1405.40_extractsplice
hisat2-build: hisat2-build -p 4 --exon Original_refseq1405.40_extractexon --ss Original_refseq1405.40_extractsplice GCF_000001405.40_GRCh38.p14_genomic.fna HISAT_RefSeq1405_40_Full_Index_SS_Exon
(also tested without any thread assignment (-p))
Thanks in advance, Chris
Log out:
Settings:
Output files: "HISAT_RefSeq1405_40_Full_Index/HISAT_RefSeq1405_40_Full_Index_SS_Exon.*.ht2"
Line rate: 7 (line is 128 bytes)
Lines per side: 1 (side is 128 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Local offset rate: 3 (one in 8)
Local fTable chars: 6
Local sequence length: 57344
Local sequence overlap between two consecutive indexes: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
GCF_000001405.40_GRCh38.p14_genomic.fna
Reading reference sizes
Time reading reference sizes: 00:00:42
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Time to join reference sequences: 00:00:13
Time to read SNPs and splice sites: 00:00:02
Generation 0 (3137151402 -> 3137151402 nodes, 0 ranks)
COUNTED NEW NODES: 8
COUNTED TEMP NODES: 0
RESIZED NODES: 20
RESIZED NODES: 0
MADE NEW NODES: 22
Generation 1 (3137468813 -> 3137468813 nodes, 0 ranks)
COUNTED NEW NODES: 6
COUNTED TEMP NODES: 0
RESIZED NODES: 19
RESIZED NODES: 0
MADE NEW NODES: 23
Generation 2 (3138104089 -> 3138104089 nodes, 0 ranks)
COUNTED NEW NODES: 6
COUNTED TEMP NODES: 0
RESIZED NODES: 20
RESIZED NODES: 0
MADE NEW NODES: 23
Generation 3 (3139375392 -> 3139375392 nodes, 0 ranks)
BUILT FROM_INDEX: 17
COUNTED NEW NODES: 6
COUNTED TEMP NODES: 0
RESIZED NODES: 20
RESIZED NODES: 0
MADE NEW NODES: 24
RESIZE NODES: 68
COUNT NUMBER IN EACH BIN: 14
FINISHED FIRST ROUND: 26
26 789568741
103 841602280
67 786479349
170 724271090
FINISHED RECURSIVE SORTS: 87
SORT NODES: 127
MERGE, UPDATE RANK: 69
Generation 4 (3141921460 -> 3141356782 nodes, 1139244126 ranks)
ALLOCATE FROM_TABLE: 32
COUNT NUMBER IN EACH BIN: 13
FINISHED FIRST ROUND: 36
94 789748680
94 789476690
93 781444256
93 780687156
FINISHED RECURSIVE SORTS: 71
BUILD TABLE: 120
BUILD INDEX: 18
82 nodes, 1139244126 ranks)
ALLOCATE FROM_TABLE: 32
COUNT NUMBER IN EACH BIN: 13
FINISHED FIRST ROUND: 36
Hello Biostars community,
For anyone else whom this issue may concern, It was obviated with a bare metal install of Linux. We did not hear back from HISAT2 developers.
Happy to share further details if anyone wants to reach out, Chris.