Question

HISAT2 Error Building Index

0

Entering edit mode

5.1 years ago

adam.faranda ▴ 110

I consistently get the following error message when I try and build a HISAT2 index for a Mouse geneome.

Error: Encountered internal HISAT2 exception (#1)

My call to hisat2 is as follows:

hisat2-build -f -p 8 --ss genome.ss --exon genome.exon $GENOME genome_tran

Where "$GENOME" contains a comma separated list of fasta files (one for each chromosome)

I'm using the "build_index.sh" script with some minor modifications. Up until now, I've not had an issue with index building. I'm running this on a unix server with slurm job control; I've verified that my job is being assigned 8 cpus, and at least 500 Gb RAM.

My complete HISAT2 output and script are posted below. If anyone has any ideas about how to trouble shoot, please let me know.

Complete HISAT2 output:

home/abf/bin/hisat2-build
/home/abf/bin/hisat2_extract_splice_sites.py
/home/abf/bin/hisat2_extract_exons.py
Settings:
  Output files: "genome_tran.*.ht2"
  Line rate: 7 (line is 128 bytes)
  Lines per side: 1 (side is 128 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Local sequence length: 57344
  Local sequence overlap between two consecutive indexes: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  chr.1.fa
  chr.2.fa
  chr.3.fa
  chr.4.fa
  chr.5.fa
  chr.6.fa
  chr.7.fa
  chr.8.fa
  chr.9.fa
  chr.10.fa
  chr.11.fa
  chr.12.fa
  chr.13.fa
  chr.14.fa
  chr.15.fa
  chr.16.fa
  chr.17.fa
  chr.18.fa
  chr.19.fa
  chr.X.fa
  chr.Y.fa
  chr.MT.fa
Reading reference sizes
Reading reference sizes
  Time reading reference sizes: 00:00:22
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:18
  Time to read SNPs and splice sites: 00:00:01
Total time for call to driver() for forward index: 00:20:28
Error: Encountered internal HISAT2 exception (#1)
Command: hisat2-build --wrapper basic-0 -f -p 8 --ss genome.ss --exon genome.exon chr.1.fa,chr.2.fa,chr.3.fa,chr.4.fa,chr.5.fa,chr.6.fa,chr.7.fa,chr.8.fa,chr.9.fa,chr.10.fa,chr.11.fa,chr.12.fa,chr.13.fa,chr.14.fa,chr.15.fa,chr.16.fa,chr.17.fa,chr.18.fa,chr.19.fa,chr.X.fa,chr.Y.fa,chr.MT.fa genome_tran
Deleting "genome_tran.1.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.2.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.3.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.4.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.5.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.6.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.7.ht2" file written during aborted indexing attempt.
Deleting "genome_tran.8.ht2" file written during aborted indexing attempt.

My Script:

#!/bin/sh
#SBATCH --job-name=BUILD_MOUSE_INDEX
#SBATCH --ntasks=8
#SBATCH --mem=512000

# Downloads sequence for the GRCm38 release 96 version of M. musculus (mouse) from
# Ensembl.
#
# By default, this script builds and index for just the base files,
# since alignments to those sequences are the most useful.  To change
# which categories are built by this script, edit the CHRS_TO_INDEX
# variable below.
#

export PATH=$PATH:/home/abf/bin
declare -a CHR=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y MT)
declare -a GENOME=()

ENSEMBL_RELEASE=98
ENSEMBL_GRCm38_BASE=ftp://ftp.ensembl.org/pub/release-${ENSEMBL_RELEASE}/fasta/mus_musculus/dna
ENSEMBL_GRCm38_GTF_BASE=ftp://ftp.ensembl.org/pub/release-${ENSEMBL_RELEASE}/gtf/mus_musculus
GTF_FILE=Mus_musculus.GRCm38.${ENSEMBL_RELEASE}.chr.gtf # Excludes unplaced contigs
# GTF_FILE=Mus_musculus.GRCm38.${ENSEMBL_RELEASE}.gtf

get() {
        file=$1
        if ! wget --version >/dev/null 2>/dev/null ; then
                if ! curl --version >/dev/null 2>/dev/null ; then
                        echo "Please install wget or curl somewhere in your PATH"
                        exit 1
                fi
                curl -o `basename $1` $1
                return $?
        else
                wget -nv $1
                return $?
        fi
}

HISAT2_BUILD_EXE=./hisat2-build
if [ ! -x "$HISAT2_BUILD_EXE" ] ; then
        if ! which hisat2-build ; then
                echo "Could not find hisat2-build in current directory or in PATH"
                exit 1
        else
                HISAT2_BUILD_EXE=`which hisat2-build`
        fi
fi

HISAT2_SS_SCRIPT=./hisat2_extract_splice_sites.py
if [ ! -x "$HISAT2_SS_SCRIPT" ] ; then
        if ! which hisat2_extract_splice_sites.py ; then
                echo "Couldnt find hisat2_extract_splice_sites.py in current directory or PATH"
                exit 1
        else
                HISAT2_SS_SCRIPT=`which hisat2_extract_splice_sites.py`
        fi
fi

HISAT2_EXON_SCRIPT=./hisat2_extract_exons.py
if [ ! -x "$HISAT2_EXON_SCRIPT" ] ; then
        if ! which hisat2_extract_exons.py ; then
                echo "Could not find hisat2_extract_exons.py in current directory or in PATH"
                exit 1
        else
                HISAT2_EXON_SCRIPT=`which hisat2_extract_exons.py`
        fi
fi

#rm -f genome.fa
# Un comment this block if retrieving individual chromosomes
for c in ${CHR[@]}; do

    F="Mus_musculus.GRCm38.dna.chromosome.$c.fa"
    G=$(echo $F | sed 's/Mus_musculus\.GRCm38\.dna\.chromosome\./chr./')
    if [ ! -f $G ] ; then
        get ${ENSEMBL_GRCm38_BASE}/$F.gz || (echo "Error getting $F" && exit 1)
        gunzip $F.gz || (echo "Error unzipping $F" && exit 1)
        mv $F "chr.$c.fa"
    fi
    GENOME=("${GENOME[@]}" "chr.$c.fa")

done

GENOME=$(echo ${GENOME[@]} | sed 's/\s/,/g')

if [ ! -f $GTF_FILE ] ; then
       get ${ENSEMBL_GRCm38_GTF_BASE}/${GTF_FILE}.gz || (echo "Error getting ${GTF_FILE}" && exit 1)
       gunzip ${GTF_FILE}.gz || (echo "Error unzipping ${GTF_FILE}" && exit 1)
fi

if [ ! -f genome.ss ] ; then
       ${HISAT2_SS_SCRIPT} ${GTF_FILE} > genome.ss
       ${HISAT2_EXON_SCRIPT} ${GTF_FILE} > genome.exon
fi

hisat2-build -f -p 8 --ss genome.ss --exon genome.exon $GENOME genome_tran

hisat2 RNA-Seq • 3.1k views

ADD COMMENT • link updated 3.8 years ago by Biostar 20 • written 5.1 years ago by adam.faranda ▴ 110

score 1 · Answer 1 · 2019-11-12

1

Entering edit mode

5.1 years ago

adam.faranda ▴ 110

It turns out the server that I am working on has a Hard Drive Quota of 231 gigabytes per user. I was very close to this quota (226 Gb) in my home directory. I liberated ~ 37Gb of space by deleting some old files, and the indexer now seems to be working properly. I think that the solution here is:

Make sure you have enough room on the hard drive for HISAT2 to write temporary files while building an index

In addition to having sufficient RAM.

ADD COMMENT • link 5.1 years ago by adam.faranda ▴ 110

0

Entering edit mode

You got the point right. If the required amount of RAM is unavailable, use pre-built indexes from HiSat2.

ADD REPLY • link 5.1 years ago by Arindam Ghosh ▴ 530