Hello all,
I am very new to star alignment and rna seq in general. I have 20 mouse rna bulk samples which I am trying to align to a reference genome, after performing QC filtering and trimming. To align, I am using the following commands:
STAR \
--genomeDir ./mus_musculus_c57bl6nj/starSTAR_2.5.2b \
--readFilesIn sample1_R1_.fastp.fastq.gz sample1_R2_.fastp.fastq.gz \
--runThreadN 32 \
--genomeLoad NoSharedMemory \
--outSAMtype BAM SortedByCoordinate \
--readFilesCommand zcat \
--outFileNamePrefix sample1/star_mapping/sample_init_ \
--outFilterMismatchNoverLmax 0.05 \
--seedSearchStartLmax 20 2> sample1/star_mapping/sample1.logs
STAR \
--runMode genomeGenerate \
--genomeDir sample1/star_mapping/sample1_star_2nd_pass \
--genomeFastaFiles ./mus_musculus_c57bl6nj/mus_musculus_c57bl6nj.fa \
--sjdbFileChrStartEnd sample1/star_mapping/sample1_init_SJ.out.tab \
--sjdbOverhang 100 \
--runThreadN 32 2>> sample1/star_mapping/sample1.logs
STAR \
--genomeDir sample1/star_mapping/sample1_star_2nd_pass \
--readFilesIn sample1_R1_.fastp.fastq.gz sample1_R2_.fastp.fastq.gz \
--runThreadN 32 \
--genomeLoad NoSharedMemory \
--sjdbFileChrStartEnd sample1/star_mapping/sample1_init_SJ.out.tab \
--outSAMtype BAM SortedByCoordinate \
--readFilesCommand zcat \
--outFileNamePrefix sample1/star_mapping/sample1_ \
--outFilterMismatchNoverLmax 0.05 \
--seedSearchStartLmax 20 2>> sample1/star_mapping/sample1.logs
For 19 of the samples, the whole pipeline is taking about 3 hours to complete. But there is one sample that takes over 24h, the first pass taking around 2 hours and the remaining are mostly for the 2nd pass.
Here are the statistics for the 1st pass:
For the 2nd I only have the progress log, but the speed is around 0.3M / hour.
Could anyone help me speeding this process up for this one sample? Thank you
A couple of questions:
SJ.out.tab
files from the first pass in each second pass run as outlined in the manual (see section 8, especially 8.3 on how your approach is outdated. If you need to do per-sample, the--twoPassMode Basic
in a single run works better. If not, runalignReads
the second time with all the SJ filesHello Ram .
First off, thank you so much for this explanation. I did not know I was using a legacy version of star 2nd pass to do per sample alignment. The problem is that I am using a python package called sequana to run these pipelines, and due to environment constraints, I can only use a sequana version that uses the original 2nd pass method, unfortunatetly ( the one I described).
I will explore the 2nd point you gave about the sorted bam. Just as a curiosity though, from the statistics of the 1st pass, I can see that the number of uniquely mapped reads is really low for this sample, as well as the number of splices. Could this be a possible cause for the slow 2nd pass?
Hi,
Try and fix the way you run the pipeline so you can use newer, better ways to get results. I can't speak to the relationship between runtime and the two factors you listed, but a 24h run is not unusual with STAR. I've seen it take longer on smaller files and shorter on larger files, so I don't think number of reads is the only factor either. Sometimes, all you can do is give it time.