I consistently get the following error message when I try and build a HISAT2 index for a Mouse geneome.
Error: Encountered internal HISAT2 exception (#1)
My call to hisat2 is as follows:
hisat2-build -f -p 8 --ss genome.ss --exon genome.exon $GENOME genome_tran
Where "$GENOME" contains a comma separated list of fasta files (one for each chromosome)
I'm using the "build_index.sh" script with some minor modifications. Up until now, I've not had an issue with index building. I'm running this on a unix server with slurm job control; I've verified that my job is being assigned 8 cpus, and at least 500 Gb RAM.
My complete HISAT2 output and script are posted below. If anyone has any ideas about how to trouble shoot, please let me know.
Complete HISAT2 output:
home/abf/bin/hisat2-build
/home/abf/bin/hisat2_extract_splice_sites.py
/home/abf/bin/hisat2_extract_exons.py
Settings:
Output files: "genome_tran.*.ht2"
Line rate: 7 (line is 128 bytes)
Lines per side: 1 (side is 128 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Local offset rate: 3 (one in 8)
Local fTable chars: 6
Local sequence length: 57344
Local sequence overlap between two consecutive indexes: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
chr.1.fa
chr.2.fa
chr.3.fa
chr.4.fa
chr.5.fa
chr.6.fa
chr.7.fa
chr.8.fa
chr.9.fa
chr.10.fa
chr.11.fa
chr.12.fa
chr.13.fa
chr.14.fa
chr.15.fa
chr.16.fa
chr.17.fa
chr.18.fa
chr.19.fa
chr.X.fa
chr.Y.fa
chr.MT.fa
Reading reference sizes
Reading reference sizes
Time reading reference sizes: 00:00:22
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Time to join reference sequences: 00:00:18
Time to read SNPs and splice sites: 00:00:01
Total time for call to driver() for forward index: 00:20:28
Error: Encountered internal HISAT2 exception (#1)
Command: hisat2-build --wrapper basic-0 -f -p 8 --ss genome.ss --exon genome.exon chr.1.fa,chr.2.fa,chr.3.fa,chr.4.fa,chr.5.fa,chr.6.fa,chr.7.fa,chr.8.fa,chr.9.fa,chr.10.fa,chr.11.fa,chr.12.fa,chr.13.fa,chr.14.fa,chr.15.fa,chr.16.fa,chr.17.fa,chr.18.fa,chr.19.fa,chr.X.fa,chr.Y.fa,chr.MT.fa genome_tran
Deleting "genome_tran.1.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.2.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.3.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.4.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.5.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.6.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.7.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.8.ht2" file written during aborted indexing attempt.
My Script:
#!/bin/sh
#SBATCH --job-name=BUILD_MOUSE_INDEX
#SBATCH --ntasks=8
#SBATCH --mem=512000
# Downloads sequence for the GRCm38 release 96 version of M. musculus (mouse) from
# Ensembl.
#
# By default, this script builds and index for just the base files,
# since alignments to those sequences are the most useful. To change
# which categories are built by this script, edit the CHRS_TO_INDEX
# variable below.
#
export PATH=$PATH:/home/abf/bin
declare -a CHR=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y MT)
declare -a GENOME=()
ENSEMBL_RELEASE=98
ENSEMBL_GRCm38_BASE=ftp://ftp.ensembl.org/pub/release-${ENSEMBL_RELEASE}/fasta/mus_musculus/dna
ENSEMBL_GRCm38_GTF_BASE=ftp://ftp.ensembl.org/pub/release-${ENSEMBL_RELEASE}/gtf/mus_musculus
GTF_FILE=Mus_musculus.GRCm38.${ENSEMBL_RELEASE}.chr.gtf # Excludes unplaced contigs
# GTF_FILE=Mus_musculus.GRCm38.${ENSEMBL_RELEASE}.gtf
get() {
file=$1
if ! wget --version >/dev/null 2>/dev/null ; then
if ! curl --version >/dev/null 2>/dev/null ; then
echo "Please install wget or curl somewhere in your PATH"
exit 1
fi
curl -o `basename $1` $1
return $?
else
wget -nv $1
return $?
fi
}
HISAT2_BUILD_EXE=./hisat2-build
if [ ! -x "$HISAT2_BUILD_EXE" ] ; then
if ! which hisat2-build ; then
echo "Could not find hisat2-build in current directory or in PATH"
exit 1
else
HISAT2_BUILD_EXE=`which hisat2-build`
fi
fi
HISAT2_SS_SCRIPT=./hisat2_extract_splice_sites.py
if [ ! -x "$HISAT2_SS_SCRIPT" ] ; then
if ! which hisat2_extract_splice_sites.py ; then
echo "Couldnt find hisat2_extract_splice_sites.py in current directory or PATH"
exit 1
else
HISAT2_SS_SCRIPT=`which hisat2_extract_splice_sites.py`
fi
fi
HISAT2_EXON_SCRIPT=./hisat2_extract_exons.py
if [ ! -x "$HISAT2_EXON_SCRIPT" ] ; then
if ! which hisat2_extract_exons.py ; then
echo "Could not find hisat2_extract_exons.py in current directory or in PATH"
exit 1
else
HISAT2_EXON_SCRIPT=`which hisat2_extract_exons.py`
fi
fi
#rm -f genome.fa
# Un comment this block if retrieving individual chromosomes
for c in ${CHR[@]}; do
F="Mus_musculus.GRCm38.dna.chromosome.$c.fa"
G=$(echo $F | sed 's/Mus_musculus\.GRCm38\.dna\.chromosome\./chr./')
if [ ! -f $G ] ; then
get ${ENSEMBL_GRCm38_BASE}/$F.gz || (echo "Error getting $F" && exit 1)
gunzip $F.gz || (echo "Error unzipping $F" && exit 1)
mv $F "chr.$c.fa"
fi
GENOME=("${GENOME[@]}" "chr.$c.fa")
done
GENOME=$(echo ${GENOME[@]} | sed 's/\s/,/g')
if [ ! -f $GTF_FILE ] ; then
get ${ENSEMBL_GRCm38_GTF_BASE}/${GTF_FILE}.gz || (echo "Error getting ${GTF_FILE}" && exit 1)
gunzip ${GTF_FILE}.gz || (echo "Error unzipping ${GTF_FILE}" && exit 1)
fi
if [ ! -f genome.ss ] ; then
${HISAT2_SS_SCRIPT} ${GTF_FILE} > genome.ss
${HISAT2_EXON_SCRIPT} ${GTF_FILE} > genome.exon
fi
hisat2-build -f -p 8 --ss genome.ss --exon genome.exon $GENOME genome_tran
You got the point right. If the required amount of RAM is unavailable, use pre-built indexes from HiSat2.