Hello, I am attempting to create an new index from Emsemble reference files, and the index builder is taking far longer than what I am used to when creating a new index. The builder command has been running now for >48 hrs and I am a bit confused on why it is taking so long/if it is working.
I am running: hisat2-build -p 6 --ss /path/to/CanFam3.1.97_intron.bed --exon /path/to/CanFam3.1.97_exonsFile.table -f /path/to/Canis_familiaris.CanFam3.1.dna.toplevel.fa CanFam3.1.97
And the output I have gotten from this run so far is:
Settings:
Output files: "CanFam3.1.97.*.ht2"
Line rate: 7 (line is 128 bytes)
Lines per side: 1 (side is 128 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Local offset rate: 3 (one in 8)
Local fTable chars: 6
Local sequence length: 57344
Local sequence overlap between two consecutive indexes: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
/scratch/clove/canids/Reference/Genome/Ensemble/Canis_familiaris.CanFam3.1.dna.toplevel.fa
Reading reference sizes
Time reading reference sizes: 00:00:17
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Time to join reference sequences: 00:00:13
But it has been on this last 'Time to join reference sequences' for >12 hrs.
The .fa file appears to be formatted correctly:
>1 dna:chromosome chromosome:CanFam3.1:1:1:122678785:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTATGTGAGAAGATAGCTGAA
CGCCTTGTCCACATCATCTTACTGCTGAGAGTTGAGCTCACCCTCAGTCCCTCACAGTTC
CACACTGCCTGCAGAGTGAGTTTCCCATGTCTTCACCAGAGACTTTTGCCAGAGGCTTCT
GAGACGCAAGTTAACAATGCAGACCTGGAGGGTATCTCCAGGTGCAGTAGAGTGGTAATC
TCGGAACCTCCTGACTCAGAATACTGCTACCTTCACACTGTCATAAGAATGCAGCGAGTT
GAGAGCTGGCTTCTAGGCATGCTTCCTTTTGAGAGCTGAGGACAGGACAGAACCCTCCCG
CATCCTGCCTGACTGTAGACGTACCTGCTAACCTCCTCATGTTAGTGGCTGGGATAGATT
GTGGGAAAAGCATGTGTAAGCATTGGGCCTGAACTCCCGTGTATCTGAGTTGAATACAGC
As does the gtf file that the intron and exon files were created from:
X ensembl gene 1575 5716 . + . gene_id "ENSCAFG00000010935"; gene_version "3"; gene_source "ensembl"; gene_biotype "protein_coding";
X ensembl transcript 1575 5716 . + . gene_id "ENSCAFG00000010935"; gene_version "3"; transcript_id "ENSCAFT00000017396"; transcript_version "3"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding";
Can anyone help me determine why this index is taking far more time to run than when I have created them in the past?
Thank you for your help!
Does it still run? You can check with the
top
command in a new terminl window.Yes, it does appear to still be running.
Have you solved the problem yet? I have the same problem.