Hi everyone, I need help with a script I wrote to make genome assembly as easy as possible (also to allow lab mates to use it). This is basically my first time coding as my scientific training has always revolved around wet-lab.
Since many SRA files of rice cultivars are available on NCBI, I thought of substituting time-consuming genotyping with genome assembly to look for allelic variations (we are also planning on sequencing our own lines in the future). I'm aware that some variations will be hard (or even impossible) to assemble since the pipeline is alignement-based (and if the entire locus is missing in the reference it cannot be aligned in my cultivars of interest), but I'm here to ask if you see any issues in the script that could lead me to false results.
#!/bin/bash
# Check if the correct number of arguments are provided
if [ "$#" -ne 4 ]; then
echo "Usage: $0 <cultivar_name> <fastq_file> <reference_genome>"
exit 1
fi
# Variables from user input
CULTIVAR=$1
fastq_file1=$2
fastq_file2=$3
REFERENCE_GENOME=$4
THREADS=$(nproc) # Number of available CPU cores
# Create directory for the cultivar
mkdir -p $CULTIVAR
cd
# Function to check if a program is installed, and install it if not
install_if_not_present() {
if ! command -v $1 &> /dev/null; then
echo "$1 could not be found, installing..."
sudo apt-get install -y $2
else
echo "$1 is already installed."
fi
}
# Update package list
sudo apt-get update
# Install necessary tools if not already installed
install_if_not_present fastqc fastqc
install_if_not_present bwa bwa
install_if_not_present samtools samtools
install_if_not_present bcftools bcftools
install_if_not_present java openjdk-11-jre
install_if_not_present wget wget
install_if_not_present unzip unzip
install_if_not_present gzip gzip
install_if_not_present pv pv
#Unzip files
if [[ $fastq_file1 == *.gz ]]; then
gunzip -c $fastq_file1 > $CULTIVAR/${CULTIVAR}_1.fastq
else
cp $fastq_file1 $CULTIVAR/${CULTIVAR}_1.fastq
fi
if [[ $fastq_file2 == *.gz ]]; then
gunzip -c $fastq_file2 > $CULTIVAR/${CULTIVAR}_2.fastq
else
cp $fastq_file2 $CULTIVAR/${CULTIVAR}_2.fastq
fi
# Create output directories
mkdir -p ${CULTIVAR}/qc_reports
# Quality check raw reads
fastqc -t $THREADS ${CULTIVAR}/${CULTIVAR}_1.fastq -o ${CULTIVAR}/qc_reports/
fastqc -t $THREADS ${CULTIVAR}/${CULTIVAR}_2.fastq -o ${CULTIVAR}/qc_reports/
# Download and unzip Trimmomatic
wget -P $CULTIVAR http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip
unzip ${CULTIVAR}/Trimmomatic-0.39.zip -d $CULTIVAR
# Trimmomatic for read trimming
java -jar ${CULTIVAR}/Trimmomatic-0.39/trimmomatic-0.39.jar PE ${CULTIVAR}/${CULTIVAR}_1.fastq ${CULTIVAR}/${CULTIVAR}_2.fastq ${CULTIVAR}/trimmed_1P.fastq ${CULTIVAR}/trimmed_1U.fastq ${CULTIVAR}/trimmed_2P.fastq ${CULTIVAR}/trimmed_2U.fastq ILLUMINACLIP:${CULTIVAR}/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
# Delete raw FASTQ files after trimming to save space
rm -f ${CULTIVAR}/${CULTIVAR}_1.fastq
rm -f ${CULTIVAR}/${CULTIVAR}_2.fastq
# BWA index
bwa index $REFERENCE_GENOME
# Align reads to reference genome
bwa mem -t $THREADS $REFERENCE_GENOME ${CULTIVAR}/trimmed_1P.fastq ${CULTIVAR}/trimmed_2P.fastq > ${CULTIVAR}/aligned_reads.sam
# Convert SAM to BAM, sort and index
samtools view -S -b ${CULTIVAR}/aligned_reads.sam | samtools sort -o ${CULTIVAR}/sorted_aligned_reads.bam
samtools index ${CULTIVAR}/sorted_aligned_reads.bam
# Delete trimmed FASTQ files after alignment
rm -f ${CULTIVAR}/trimmed_1P.fastq
rm -f ${CULTIVAR}/trimmed_2P.fastq
rm -f ${CULTIVAR}/trimmed_1U.fastq
rm -f ${CULTIVAR}/trimmed_2U.fastq
#Download Picard
wget -P $CULTIVAR https://github.com/broadinstitute/picard/releases/download/2.26.10/picard.jar
# Mark duplicates
java -jar ${CULTIVAR}/picard.jar MarkDuplicates I=${CULTIVAR}/sorted_aligned_reads.bam O=${CULTIVAR}/dedup_aligned_reads.bam M=${CULTIVAR}/metrics.txt
samtools index ${CULTIVAR}/dedup_aligned_reads.bam
# Call variants
bcftools mpileup -Ou -f $REFERENCE_GENOME ${CULTIVAR}/dedup_aligned_reads.bam | bcftools call -mv -Oz -o ${CULTIVAR}/variants.vcf.gz
bcftools index ${CULTIVAR}/variants.vcf.gz
# Normalize variants
bcftools norm -f $REFERENCE_GENOME ${CULTIVAR}/variants.vcf.gz -Oz -o ${CULTIVAR}/normalized_variants.vcf.gz
bcftools index ${CULTIVAR}/normalized_variants.vcf.gz
#Filter variants
bcftools filter -i 'QUAL>30 & DP>10 & GQ>20' -Oz -o ${CULTIVAR}/filtered_variants.vcf.gz ${CULTIVAR}/normalized_variants.vcf.gz
bcftools index ${CULTIVAR}/filtered_variants.vcf.gz
# Generate consensus sequence
bcftools consensus -f $REFERENCE_GENOME ${CULTIVAR}/normalized_variants.vcf.gz > ${CULTIVAR}/${CULTIVAR}_consensus_sequence.fasta
## Delete Picard jar file if not needed anymore
rm -f ${CULTIVAR}/picard.jar
#Prevent terminal from closing
echo "Press any key to exit..."
read -n 1 -s
And here's the script for the user interface to make it usable by colleagues
#!/bin/bash
# Install Zenity if not installed
if ! command -v zenity &> /dev/null
then
echo "Zenity could not be found, installing..."
sudo apt-get install -y zenity
fi
# Get user input through Zenity dialogs
CULTIVAR=$(zenity --entry --title="Cultivar Name" --text="Enter the name of the cultivar:")
if [ -z "$CULTIVAR" ]; then
zenity --error --text="Cultivar name is required."
exit 1
fi
fastq_file1=$(zenity --file-selection --title="Select forward reads file")
if [ -z "$fastq_file1" ]; then
zenity --error --text="Forward reads file is required."
exit 1
fi
fastq_file2=$(zenity --file-selection --title="Select reverse reads file")
if [ -z "$fastq_file2" ]; then
zenity --error --text="Reverse reads file is required."
exit 1
fi
REFERENCE_GENOME=$(zenity --file-selection --title="Select Reference Genome")
if [ -z "$REFERENCE_GENOME" ]; then
zenity --error --text="Reference genome file is required."
exit 1
fi
# Verify the main genome assembly script exists
if [ ! -f /path/to/Genome_assembly_fastq/genome_assembly_fastq.sh ]; then
zenity --error --text="Main genome assembly script not found at /path/to/Genome_assembly_fastq/genome_assembly_fastq.sh"
exit 1
fi
# Run the main genome assembly script with user inputs
/path/to/Genome_assembly_fastq/genome_assembly_fastq.sh "$CULTIVAR" "$fastq_file1" "$fastq_file2" "$REFERENCE_GENOME"
# Check if the script ran successfully
if [ $? -eq 0 ]; then
zenity --info --text="Genome assembly completed successfully."
else
zenity --error --text="Genome assembly failed."
fi
Thanks!
I would highly encourage you to use a pipeline management software like snakemake or nextflow, it will make things much easier, especially with regards to installing software. Also check out nf-core pipelines (written in nextflow) they probably wrote a very similar, if not identical, pipeline which you can use out of the box or with minor adjustments.
I'd second that. It will also solve problems of re-entrance, reproducibility, and software versions with less code. On the other hand, it comes with another steep learning curve and a bunch of strings (e.g. using conda) attached. Possibly that is a bit much to take in if you are still learning shell programming. But I do agree in principle.