As I mentioned in my old post, I was unable to index a toplevel genome (both unmasked and soft-masked) with HISAT2. I still have problems with that. I'm using command as below: hisat2-build -f Mus_musculus.GRCm38.dna.toplevel.fa.gz Cm3895_ht2/GRCm38
Firstly, it gives these warnings in lots of lines:
Warning: Encountered empty reference sequence
Warning: Encountered reference sequence with only gaps
and after some time, it gives an error as below:
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Reference file does not seem to be a FASTA file
Time to join reference sequences: 00:00:00
Total time for call to driver() for forward index: 00:28:31
Error: Encountered internal HISAT2 exception (#1)
Command: hisat2-build --wrapper basic-0 -f Mus_musculus.GRCm38.dna.toplevel.fa.gz Cm3895_ht2/GRCm38
Deleting "Cm3895_ht2/GRCm38.1.ht2" file written during aborted indexing attempt.
Deleting "Cm3895_ht2/GRCm38.2.ht2" file written during aborted indexing attempt.
Deleting "Cm3895_ht2/GRCm38.3.ht2" file written during aborted indexing attempt.
Deleting "Cm3895_ht2/GRCm38.4.ht2" file written during aborted indexing attempt.
Deleting "Cm3895_ht2/GRCm38.5.ht2" file written during aborted indexing attempt.
Deleting "Cm3895_ht2/GRCm38.6.ht2" file written during aborted indexing attempt.
Deleting "Cm3895_ht2/GRCm38.7.ht2" file written during aborted indexing attempt.
Deleting "Cm3895_ht2/GRCm38.8.ht2" file written during aborted indexing attempt.
Previously, I had no problem when using separate chromosome files. Is there anything I'm missing when using toplevel genome? Thanks...
Guess you have the answer inside error log: 'Reference file does not seem to be a FASTA file'. Try to unpack the reference file to fasta format and run index build once again.
Yes, it worked after unpacking. Gzipped files normally work with main hisat2 command, therefore I couldn't think about this reason. Thank you...
Worth reading: https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use