Hi, I am new to bioinformatics, especially on the command line. I am trying to run STAR alignment on pairs of fastq.gz files from several samples generated as part of an RNAseq experiment. My goal is to perform splice variant analysis on the output. I am submitting the following slurm job:
#!/bin/bash
#SBATCH --job-name=STAR_alignment
#SBATCH --output=star_alignment_%j.out
#SBATCH --error=star_alignment_%j.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=125G
#SBATCH --time=7:59:59
GENOME_DIR=~/indices_directory
DATA_DIR=~/Arthur_RNAseq
cd $DATA_DIR
for folder in $(ls -d */); do
cd $folder
SAMPLE=$(basename $folder)
# Create a new output directory for STAR results
OUTPUT_DIR=$DATA_DIR/$SAMPLE/STAR_output
mkdir -p $OUTPUT_DIR
STAR --genomeDir $GENOME_DIR \
--readFilesIn ${SAMPLE}_merged_R1.fastq.gz ${SAMPLE}_merged_R2.fastq.gz \
--runThreadN 32 \
--outFileNamePrefix $OUTPUT_DIR/${SAMPLE}_ \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMattributes Standard
cd $DATA_DIR
done
However, after running for a few minutes, the process is killed and in the .err file I am getting the following error:
ReadAlignChunk_processChunks.cpp:202:processChunks EXITING because of FATAL ERROR in input reads: unknown file format: the read ID should start with @ or >
I have double checked the files and all start with @ and follow correct format. Only concern is that the read length is 50 instead of the recommended 100 for splicing analysis.
Does anybody have any clues or ideas on how to get around this problem?
To debug what is being feed to STAR simply:
Also you may be missing :
If you do not have pigz installed use gzip instead.