Question

bash script for RNA seq alignment with STAR

4

Entering edit mode

7.3 years ago

nicole p ▴ 40

Hi all,

I need to come up with some kind of automated script that will align each sample to the my reference genome. I am using STAR to align reads. I have reads that look something like this (all paired end):

SRR112312_1.fastq.gz 
SRR112312_2.fastq.gz
SRR112313_1.fastq.gz
SRR112313_2.fastq.gz
SRR112372_1.fastq.gz
SRR112372_2.fastq.gz
...

I have tried with code like this:

for i in $(ls *.fastq.gz | rev | cut -c 11- | rev | uniq)

My typical STAR command looks like this:

STAR --genomeDir /indices/human --readFilesIn SRR112312_1.fastq.gz SRR112312_2.fastq.gz --runThreadN 8 --outFileNamePrefix /alignment/treg_NBP_8_ --quantMode GeneCounts --readFilesCommand zcat

I tried using the answer from this post: bash loop for alignment RNA-seq data but im not having any luck. I would like to align each PE read to the genome to get a BAM file. In this example, there would be 3 BAM files. Also, all of my reads end as in _1.fastq.gz and _2.fastq.gz

If anyone can offer any help that would be much appreciated! Just jumping into scripting and this has been giving me a hard time. Thanks!

script bash RNA-Seq STAR • 16k views

ADD COMMENT • link updated 7.3 years ago by mforde84 ★ 1.4k • written 7.3 years ago by nicole p ▴ 40

score 5 · Answer 1 · 2017-08-10

This is the script I am currently using, with my default settings, so it can be called just as: star-wrapper.sh reads_1.fastq.gz reads_2.fastq.gz

#!/bin/sh
set -eu

## this is a wrapper for mapping single end and paired end data

STAR=`which STAR` # adapt to your needs
### lsalmonis genome for 76bp reads
DBDIR=licebase/genomedata/lsal76ribo

BASE=`basename $1`
DNAME=`dirname $1`
#BASE=VOD/$BASE
#ZCAT=
ZCAT="--readFilesCommand zcat"
THREADS=140
OPTS_P1="--outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 100000000000 --genomeLoad LoadAndKeep --seedSearchStartLmax 8 --outFilterMultimapNmax 100 --outFilterMismatchNoverLmax 0.5 --outFileNamePrefix $DNAME/${BASE}"

if [ "$#" -ge 3  ]; then
   echo "illegal number of parameters"
   exit 1
fi

if [ "$#" -eq 2 ]; then
CMD="$STAR --runThreadN $THREADS --outBAMsortingThreadN 10 $OPTS_P1 $ZCAT  --genomeDir $DBDIR --readFilesIn $1 $2"
echo $CMD
$CMD
fi

if [ "$#" -eq 1 ]; then
CMD="$STAR --runThreadN $THREADS --outBAMsortingThreadN 10 $OPTS_P1 $ZCAT  --genomeDir $DBDIR --readFilesIn $1"
echo $CMD
$CMD
fi

Then I have a little wrapper:

#!/bin/sh
set -eu

for F in $@ ; do
  R1=`echo $F | sed s/_2/_1/`
  R2=`echo $F | sed s/_1/_2/`
  echo "Processing $R1 $R2"
  ./star-wrapper.sh $R1 $R2
  echo "Done"
done

which can be called like nohup doit.sh *_1.fastq.gz for paired end reads on all reads in the directory.

score 2 · Answer 2 · 2017-08-24

You can to use bpipe, to automate and parallelize your pipeline.

For example, you can create a file STAR_mapping.groovy then write your script as follow (supposing that you already have done the adaptor trimming and quality control)

runSTAR = {
   def prefix="/alignment/treg_NBP_8_ "
   exec """
   STAR --genomeDir /indices/human 
   --readFilesIn $inputs
   --runThreadN 8 
   --outFileNamePrefix ${prefix}
   --quantMode GeneCounts 
   --readFilesCommand zcat
   """
 }

 run {
   "%_*.fastq.gz " * [runSTAR]
 }

Then you can run it using the following command:

bpipe run -r -n 2 STAR_mapping.groovy *.fastq.gz

Here: - r : generate an HTML report / documentation for the pipeline -n : how many parallel job you allow to run

Here -n 2 means, you'll be running two STAR mapping jobs in the same time. In your script, each job will need 8 threads. In total you'll be using 2*8=16 threads. Also keep in mind that STAR uses about 32Gb, so here you'll need 32 *2 =64Gb.

you can set -n 1 to allow just one job at a time if you don't have too much resources.

score 0 · Answer 3 · 2017-08-24

0

Entering edit mode

7.3 years ago

mforde84 ★ 1.4k

https://github.com/mforde84/RNAseq_BigMechanism_DARPA

ADD COMMENT • link 7.3 years ago by mforde84 ★ 1.4k