Hello Everyone,
Recently I had to perform RNA-Seq analysis on multiple datasets taken from NCBI-SRA. So I thought of making a GENERAL bash script where I just have to change some initial variables and the code/script would work on any data I provide as input. Hypothetically I thought of doing that, so I ran each command individually first to see if it gives any error. Although what I did works as I wanted it to, I feel that all that process is computationally extensive, takes a lot of time, and uses too many computational resources. Currently, I am using a HP-Prolient Gen10 with 48 core (96 threads) and 440 Gb RAM.
I am not a UNIX expert in any way, so I did all that to as best of my knowledge. My main Query is if anyone could take a look at the work I did and suggest changes which will improve this Here is What I Did
#!/bin/bash
set -e #terminate if any command gives error
set -x #print every command to screen
# Hardcode Parameters
threads=90 # Number of threads assignment
ref='Ref/oryza_sativa.fa' # Path/name of Reference Fasta file
gtf='Ref/oryza_sativa.gtf' # Path/Name of Refetence GTF file
# Index Building
hisat2-build -p $threads $ref $ref > Ref/Index.log # Creating Reference File Indexes and Log file
samtools faidx $ref # Creating .fai index
# samplelist contains a list of all accession numbers
mapfile -t Acc < samplelist # Reading Sample List file into an array
max_i=$(wc -l samplelist | awk '{print $1}') # max_i contains total number of IDs in sample list file
for((i=0; i<max_i; i++))
do
### Converting SRR files to FastQ and zipping #
# Getting FastQ files from SRA files
sratoolkit/bin/fastq-dump -v --readids --split-files --outdir FastQ SRA/"${Acc[i]}.sra"
# Creating Zip Files for both 1st and 2nd mate pair because -gzip option in fastq-dump is deprecated and not recommanded
gzip FastQ/"${Acc[i]}_1.fastq"
gzip FastQ/"${Acc[i]}_2.fastq"
### Performing FastQC Data Trimming #
fastqc -t 32 -dir temp -o FastQC/Pre-FastQC FastQ/"${Acc[i]}_1.fastq.gz" FastQ/"${Acc[i]}_2.fastq.gz" # Raw FastQ-Reads FastQC Analysis
# Fastp Quality Trimming
fastp --thread $threads --verbose --detect_adapter_for_pe --overrepresentation_analysis --correction --average_qual 28 --report_title "${Acc[i]}" --html FastQC/Fastp_Report/"${Acc[i]}.fastp.html" --json FastQC/Fastp_Report/"${Acc[i]}.fastp.html" -i FastQ/"${Acc[i]}_1.fastq.gz" -I FastQ/"${Acc[i]}_2.fastq.gz" -o QC_Reads/"${Acc[i]}_1_QC.fastq.gz" -O QC_Reads/"${Acc[i]}_2_QC.fastq.gz" > QC_Reads/"${Acc[i]}.log"
# QC-Trimmed Reads FastQC Analysis
fastqc -t 32 -dir temp -o FastQC/Post-FastQC QC_Reads/"${Acc[i]}_1_QC.fastq.gz" QC_Reads/"${Acc[i]}_2_QC.fastq.gz"
### Alignment with Reference Genome #
# Step 1 Alignment – Map to Reference
hisat2 -p $threads -x $ref -1 QC_Reads/"${Acc[i]}_1_QC.fastq.gz" -2 QC_Reads/"${Acc[i]}_2_QC.fastq.gz" -S Alignment/"${Acc[i]}.sam" --summary-file Alignment/"${Acc[i]}.alignment.log"
# Step 2 SAM to BAM Conversion and Sorting
samtools view -@ $threads -S -b Alignment/"${Acc[i]}.sam" | samtools sort -@ $threads -T temp/tempfile -o Alignment/"${Acc[i]}.sorted.bam"
# Step 3 Bam Indexing
samtools index Alignment/"${Acc[i]}.sorted.bam"
# Step 4 Delete SAM File
rm *.sam -y
# Step 5 Running Feature Count
featureCounts -T 64 -p -O -g gene_id --tmpDir temp/ -a $gtf -o Counts/"${Acc[i]}.tab" Alignment/"${Acc[i]}.sorted.bam"
done
Thank You to all in advance. Regards: Umer Farooq
There's another mistake here in:
you don't use the $gtf file