Rna Seq Analysis Script
2
0
Entering edit mode
2.5 years ago
Umer ▴ 130

Hello Everyone,

Recently I had to perform RNA-Seq analysis on multiple datasets taken from NCBI-SRA. So I thought of making a GENERAL bash script where I just have to change some initial variables and the code/script would work on any data I provide as input. Hypothetically I thought of doing that, so I ran each command individually first to see if it gives any error. Although what I did works as I wanted it to, I feel that all that process is computationally extensive, takes a lot of time, and uses too many computational resources. Currently, I am using a HP-Prolient Gen10 with 48 core (96 threads) and 440 Gb RAM.

I am not a UNIX expert in any way, so I did all that to as best of my knowledge. My main Query is if anyone could take a look at the work I did and suggest changes which will improve this Here is What I Did

#!/bin/bash
set -e #terminate if any command gives error
set -x #print every command to screen

# Hardcode Parameters
threads=90                  # Number of threads assignment
ref='Ref/oryza_sativa.fa'   # Path/name of Reference Fasta file
gtf='Ref/oryza_sativa.gtf'  # Path/Name of Refetence GTF file

# Index Building
hisat2-build -p $threads $ref $ref > Ref/Index.log      # Creating Reference File Indexes and Log file
samtools faidx $ref                                     # Creating .fai index

# samplelist contains a list of all accession numbers
mapfile -t Acc < samplelist                             # Reading Sample List file into an array
max_i=$(wc -l samplelist | awk '{print $1}')            # max_i contains total number of IDs in sample list file

for((i=0; i<max_i; i++))
do

### Converting SRR files to FastQ and zipping #
# Getting FastQ files from SRA files
sratoolkit/bin/fastq-dump -v --readids --split-files --outdir FastQ SRA/"${Acc[i]}.sra"   
# Creating Zip Files for both 1st and 2nd mate pair because -gzip option in fastq-dump is deprecated and not recommanded
gzip FastQ/"${Acc[i]}_1.fastq"  
gzip FastQ/"${Acc[i]}_2.fastq"

### Performing FastQC Data Trimming #
fastqc -t 32 -dir temp -o FastQC/Pre-FastQC FastQ/"${Acc[i]}_1.fastq.gz" FastQ/"${Acc[i]}_2.fastq.gz" # Raw FastQ-Reads FastQC Analysis
# Fastp Quality Trimming
fastp --thread $threads --verbose --detect_adapter_for_pe --overrepresentation_analysis --correction --average_qual 28 --report_title "${Acc[i]}" --html FastQC/Fastp_Report/"${Acc[i]}.fastp.html" --json FastQC/Fastp_Report/"${Acc[i]}.fastp.html" -i FastQ/"${Acc[i]}_1.fastq.gz" -I FastQ/"${Acc[i]}_2.fastq.gz" -o QC_Reads/"${Acc[i]}_1_QC.fastq.gz" -O QC_Reads/"${Acc[i]}_2_QC.fastq.gz" > QC_Reads/"${Acc[i]}.log"
# QC-Trimmed Reads FastQC Analysis
fastqc -t 32 -dir temp -o FastQC/Post-FastQC QC_Reads/"${Acc[i]}_1_QC.fastq.gz" QC_Reads/"${Acc[i]}_2_QC.fastq.gz" 

### Alignment with Reference Genome #
# Step 1 Alignment – Map to Reference
hisat2 -p $threads -x $ref -1 QC_Reads/"${Acc[i]}_1_QC.fastq.gz" -2 QC_Reads/"${Acc[i]}_2_QC.fastq.gz" -S Alignment/"${Acc[i]}.sam" --summary-file Alignment/"${Acc[i]}.alignment.log"
# Step 2 SAM to BAM Conversion and Sorting
samtools view -@ $threads -S -b Alignment/"${Acc[i]}.sam" | samtools sort -@ $threads -T temp/tempfile -o Alignment/"${Acc[i]}.sorted.bam"
# Step 3 Bam Indexing
samtools index Alignment/"${Acc[i]}.sorted.bam"
# Step 4 Delete SAM File
rm *.sam -y
# Step 5 Running Feature Count
featureCounts -T 64 -p -O -g gene_id --tmpDir temp/ -a $gtf -o Counts/"${Acc[i]}.tab" Alignment/"${Acc[i]}.sorted.bam"
done

Thank You to all in advance. Regards: Umer Farooq

Next-Gen-Sequencing Script Ubuntu bash RNA-Seq • 3.0k views
ADD COMMENT
0
Entering edit mode

There's another mistake here in:

hisat2-build -p $threads $ref $ref > Ref/Index.log 

you don't use the $gtf file

ADD REPLY
3
Entering edit mode
2.5 years ago
Michael 55k

some minor comments:

  • I would write set -eux in the beginning, that way also throwing an error if a variable is undefined
  • your script will build the index each time it is run, e.g. check if the index file exists with if [ ! -f $filename ] or make a makefile right away, or if it is a single species, just run it once.
  • I'd add a multiqc summary, at the end and possibly after fastQC already
  • I'd drop out of the work-flow after fastQC to inspect the results, what is the QC good for otherwise?
  • if you switch to STAR, as recommended before, you can create BAM files directly
  • if you use fasterq-dump as recommended, it has a --threads option, too, but you don't really set it to 90, maybe 4-8 threads should be considered
  • I don't see any specific bash features in your script, so it will run just fine as #!/bin/sh which will most likely default to bash
  • to find out the length of an array you can use max_i=${#Acc[@]} instead of counting lines
ADD COMMENT
0
Entering edit mode

Thank You for the kind suggestions. Some more things I'd like to ask

  • can you elaborate on the 2nd option about reference indexes, Initially, I had the idea of running the script on different species assuming I have only the reference file with no indexes, so it will create indexes then

I'd apply the remaining recommendations. Kudos.

ADD REPLY
1
Entering edit mode

The point is to not index a genome if it's already been indexed. The suggestion is to add a check to see if the index exists, and if not, then index.

ADD REPLY
1
Entering edit mode

One could think about using:

 if [ ! -f $ref.h2  -a ! -f $ref.h21 ]; then # hisat-build either builds small < 4 Gb .h2 or large index
     hisat2-build -p $threads $ref $ref ## build index without splicing annotation
 fi
 if [ ! -f $ref.fai ]; then
     samtools faidx $ref    
 fi

However, to improve mapping accuracy, splice sites and exons should be added to the index, using --exon and --ss parameters. hisat2 doesn't support gtf files directly. The exon and splice-site files need to be generated using scripts included with hisat. See the manual.

I would still use STAR over Hisat2, it is both faster, except for genome indexing, and more sensitive.

ADD REPLY
0
Entering edit mode

Thank You for your kind suggestions. I'd apply them in my script.

ADD REPLY
2
Entering edit mode
2.5 years ago
  • Instead of fastq-dump, the recommended workflow is prefetch followed by fasterq-dump [source].
  • You don't need to quality trim RNA-seq. It's actually not recommended in most cases.
  • If you only care about transcript and/or gene level quantification you can run Salmon mapping based mode alone instead of hisat2 + featureCounts.
  • If you do require alignment use STAR + Salmon alignment based mode.

These above suggestions should greatly speed your your pipeline and provide you with more accurate results.

Additionally, Salmon requires at most 20GB of memory and about 4-8 cores, so with your compute resources you could use a program like GNU Parallel to have 10+ Salmon runs going at the same time, otherwise most of your resources will be unused.

ADD COMMENT
1
Entering edit mode

You don't need to quality trim RNA-seq. It's actually not recommended in most cases.

Can you talk about this some more? This is the first I've seen this and most pipelines I know of include trimgalore or cutadapt.

ADD REPLY
1
Entering edit mode

Aligners are able to soft clip reads as long as you are aligning to a good reference. So in a strict sense trimming is not necessary.

ADD REPLY
0
Entering edit mode

Thank You for the kind suggestions. I'd apply the remaining recommendations. Kudos.

ADD REPLY

Login before adding your answer.

Traffic: 1535 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6