Hello, I am trying to align RNA sequencing data from the NCBI SRA database to the Apis mellifera genome with STAR. The alignment worked fine. However, the mapping step of the alignment seems to be a bit slow. Furthermore, increasing the number of available threads does not improve the speed. Below you can find the command I used and the content of the Log.final.out file. Is this a good speed for STAR? Are there any methods to improve the speed?
STAR --runThreadN 12 --genomeDir ~/scratch/genomeDir --readFilesIn $word_1.fastq $word_2.fastq --outFileNamePrefix $word --outSAMtype BAM SortedByCoordinate --outSAMattrRGline ID:$word SM:$sample PL:ILLUMINA
Started job on | Nov 20 12:44:12
Started mapping on | Nov 20 12:44:12
Finished on | Nov 20 12:57:25
Mapping speed, Million of reads per hour | 52.40
Number of input reads | 11542556
Average input read length | 150
UNIQUE READS:
Uniquely mapped reads number | 10873607
Uniquely mapped reads % | 94.20%
Average mapped length | 149.74
Number of splices: Total | 3605561
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 3574735
Number of splices: GC/AG | 23103
Number of splices: AT/AC | 1720
Number of splices: Non-canonical | 6003
Mismatch rate per base, % | 0.46%
Deletion rate per base | 0.03%
Deletion average length | 2.17
Insertion rate per base | 0.02%
Insertion average length | 1.90
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 299041
% of reads mapped to multiple loci | 2.59%
Number of reads mapped to too many loci | 2889
% of reads mapped to too many loci | 0.03%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 364700
% of reads unmapped: too short | 3.16%
Number of reads unmapped: other | 2319
% of reads unmapped: other | 0.02%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
This job finished in 13 min! With larger genomes like human it can take several hours to complete similar jobs.
Not unexpected. Number of cores are one part of the equation, there could be limitation from input/output from your storage etc. Algorithms used in bioinformatics programs are not always able to linearly scale the speed. Software itself may not have been written in a way that enables this.
13 minutes is very fast. I'm currently working on an assembly polishing pipeline which has a warning on it that it may take 0.5 -10 days, so be sure of your data before going into this. Many bioinformatics tools need to run overnight or longer, so 13 minutes is a luxury.
First world problems.
What helped me was to do what they mentioned above, turn off BAM file sorting. Generating the BAM file without sorting and then using samtools is the best option (samtools sort myfile.bam -o myfile_sorted.bam). Another option is to use an aligner that consumes fewer resources, for example HISAT2.
Fewer resources does not mean faster processing.