Question

RNA star taking more than 24h to complete 2nd pass

0

Entering edit mode

2.0 years ago

manuelmourato25 • 0

Hello all,

I am very new to star alignment and rna seq in general. I have 20 mouse rna bulk samples which I am trying to align to a reference genome, after performing QC filtering and trimming. To align, I am using the following commands:

STAR \
  --genomeDir ./mus_musculus_c57bl6nj/starSTAR_2.5.2b \
  --readFilesIn sample1_R1_.fastp.fastq.gz sample1_R2_.fastp.fastq.gz \
  --runThreadN 32 \
  --genomeLoad NoSharedMemory \
  --outSAMtype BAM SortedByCoordinate \
  --readFilesCommand zcat \
  --outFileNamePrefix sample1/star_mapping/sample_init_ \
  --outFilterMismatchNoverLmax 0.05 \
  --seedSearchStartLmax 20  2> sample1/star_mapping/sample1.logs

STAR \
  --runMode genomeGenerate \
  --genomeDir sample1/star_mapping/sample1_star_2nd_pass \
  --genomeFastaFiles ./mus_musculus_c57bl6nj/mus_musculus_c57bl6nj.fa \
  --sjdbFileChrStartEnd sample1/star_mapping/sample1_init_SJ.out.tab \
  --sjdbOverhang 100 \
  --runThreadN 32  2>> sample1/star_mapping/sample1.logs

STAR \
  --genomeDir sample1/star_mapping/sample1_star_2nd_pass \
  --readFilesIn sample1_R1_.fastp.fastq.gz sample1_R2_.fastp.fastq.gz \
  --runThreadN 32 \
  --genomeLoad NoSharedMemory \
  --sjdbFileChrStartEnd sample1/star_mapping/sample1_init_SJ.out.tab \
  --outSAMtype BAM SortedByCoordinate \
  --readFilesCommand zcat \
  --outFileNamePrefix sample1/star_mapping/sample1_ \
  --outFilterMismatchNoverLmax 0.05 \
  --seedSearchStartLmax 20 2>> sample1/star_mapping/sample1.logs

For 19 of the samples, the whole pipeline is taking about 3 hours to complete. But there is one sample that takes over 24h, the first pass taking around 2 hours and the remaining are mostly for the 2nd pass.

Here are the statistics for the 1st pass:

For the 2nd I only have the progress log, but the speed is around 0.3M / hour.

Could anyone help me speeding this process up for this one sample? Thank you

enter image description here

rna-seq star • 1.2k views

ADD COMMENT • link updated 2.0 years ago by Ram 45k • written 2.0 years ago by manuelmourato25 • 0

0

Entering edit mode

A couple of questions:

The 2-pass 2-run approach works when you use all SJ.out.tab files from the first pass in each second pass run as outlined in the manual (see section 8, especially 8.3 on how your approach is outdated. If you need to do per-sample, the --twoPassMode Basic in a single run works better. If not, run alignReads the second time with all the SJ files
Why are you generating Sorted BAMs? STAR is notoriously bad at this. Generated BAM Unsorted and use samtools to sort downstream, save some memory and time in the process.

ADD REPLY • link 2.0 years ago by Ram 45k

0

Entering edit mode

Hello Ram .

First off, thank you so much for this explanation. I did not know I was using a legacy version of star 2nd pass to do per sample alignment. The problem is that I am using a python package called sequana to run these pipelines, and due to environment constraints, I can only use a sequana version that uses the original 2nd pass method, unfortunatetly ( the one I described).

I will explore the 2nd point you gave about the sorted bam. Just as a curiosity though, from the statistics of the 1st pass, I can see that the number of uniquely mapped reads is really low for this sample, as well as the number of splices. Could this be a possible cause for the slow 2nd pass?

ADD REPLY • link 2.0 years ago by manuelmourato25 • 0

0

Entering edit mode

Hi,

Try and fix the way you run the pipeline so you can use newer, better ways to get results. I can't speak to the relationship between runtime and the two factors you listed, but a 24h run is not unusual with STAR. I've seen it take longer on smaller files and shorter on larger files, so I don't think number of reads is the only factor either. Sometimes, all you can do is give it time.

ADD REPLY • link 2.0 years ago by Ram 45k