Question

Can I trust this script I wrote for assembling plant genomes?

1

Entering edit mode

3 months ago

breri ▴ 10

Hi everyone, I need help with a script I wrote to make genome assembly as easy as possible (also to allow lab mates to use it). This is basically my first time coding as my scientific training has always revolved around wet-lab.

Since many SRA files of rice cultivars are available on NCBI, I thought of substituting time-consuming genotyping with genome assembly to look for allelic variations (we are also planning on sequencing our own lines in the future). I'm aware that some variations will be hard (or even impossible) to assemble since the pipeline is alignement-based (and if the entire locus is missing in the reference it cannot be aligned in my cultivars of interest), but I'm here to ask if you see any issues in the script that could lead me to false results.

#!/bin/bash

# Check if the correct number of arguments are provided
if [ "$#" -ne 4 ]; then
    echo "Usage: $0 <cultivar_name> <fastq_file> <reference_genome>"
    exit 1
fi

# Variables from user input
CULTIVAR=$1
fastq_file1=$2
fastq_file2=$3
REFERENCE_GENOME=$4
THREADS=$(nproc) # Number of available CPU cores

# Create directory for the cultivar
mkdir -p $CULTIVAR
cd
# Function to check if a program is installed, and install it if not
install_if_not_present() {
    if ! command -v $1 &> /dev/null; then
        echo "$1 could not be found, installing..."
        sudo apt-get install -y $2
    else
        echo "$1 is already installed."
    fi
}

# Update package list
sudo apt-get update

# Install necessary tools if not already installed
install_if_not_present fastqc fastqc
install_if_not_present bwa bwa
install_if_not_present samtools samtools
install_if_not_present bcftools bcftools
install_if_not_present java openjdk-11-jre
install_if_not_present wget wget
install_if_not_present unzip unzip
install_if_not_present gzip gzip
install_if_not_present pv pv


#Unzip files
if [[ $fastq_file1 == *.gz ]]; then
    gunzip -c $fastq_file1 > $CULTIVAR/${CULTIVAR}_1.fastq
else
    cp $fastq_file1 $CULTIVAR/${CULTIVAR}_1.fastq
fi

if [[ $fastq_file2 == *.gz ]]; then
    gunzip -c $fastq_file2 > $CULTIVAR/${CULTIVAR}_2.fastq
else
    cp $fastq_file2 $CULTIVAR/${CULTIVAR}_2.fastq
fi

# Create output directories
mkdir -p ${CULTIVAR}/qc_reports

# Quality check raw reads
fastqc -t $THREADS ${CULTIVAR}/${CULTIVAR}_1.fastq -o ${CULTIVAR}/qc_reports/
fastqc -t $THREADS ${CULTIVAR}/${CULTIVAR}_2.fastq -o ${CULTIVAR}/qc_reports/


# Download and unzip Trimmomatic
wget -P $CULTIVAR http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip
unzip ${CULTIVAR}/Trimmomatic-0.39.zip -d $CULTIVAR


# Trimmomatic for read trimming
java -jar ${CULTIVAR}/Trimmomatic-0.39/trimmomatic-0.39.jar PE ${CULTIVAR}/${CULTIVAR}_1.fastq ${CULTIVAR}/${CULTIVAR}_2.fastq ${CULTIVAR}/trimmed_1P.fastq ${CULTIVAR}/trimmed_1U.fastq ${CULTIVAR}/trimmed_2P.fastq ${CULTIVAR}/trimmed_2U.fastq ILLUMINACLIP:${CULTIVAR}/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

# Delete raw FASTQ files after trimming to save space
rm -f ${CULTIVAR}/${CULTIVAR}_1.fastq
rm -f ${CULTIVAR}/${CULTIVAR}_2.fastq

# BWA index
bwa index $REFERENCE_GENOME

# Align reads to reference genome
bwa mem -t $THREADS $REFERENCE_GENOME ${CULTIVAR}/trimmed_1P.fastq ${CULTIVAR}/trimmed_2P.fastq > ${CULTIVAR}/aligned_reads.sam

# Convert SAM to BAM, sort and index
samtools view -S -b ${CULTIVAR}/aligned_reads.sam | samtools sort -o ${CULTIVAR}/sorted_aligned_reads.bam
samtools index ${CULTIVAR}/sorted_aligned_reads.bam

# Delete trimmed FASTQ files after alignment
rm -f ${CULTIVAR}/trimmed_1P.fastq
rm -f ${CULTIVAR}/trimmed_2P.fastq
rm -f ${CULTIVAR}/trimmed_1U.fastq
rm -f ${CULTIVAR}/trimmed_2U.fastq

#Download Picard
wget -P $CULTIVAR https://github.com/broadinstitute/picard/releases/download/2.26.10/picard.jar

# Mark duplicates
java -jar ${CULTIVAR}/picard.jar MarkDuplicates I=${CULTIVAR}/sorted_aligned_reads.bam O=${CULTIVAR}/dedup_aligned_reads.bam M=${CULTIVAR}/metrics.txt
samtools index ${CULTIVAR}/dedup_aligned_reads.bam

# Call variants
bcftools mpileup -Ou -f $REFERENCE_GENOME ${CULTIVAR}/dedup_aligned_reads.bam | bcftools call -mv -Oz -o ${CULTIVAR}/variants.vcf.gz
bcftools index ${CULTIVAR}/variants.vcf.gz

# Normalize variants
bcftools norm -f $REFERENCE_GENOME ${CULTIVAR}/variants.vcf.gz -Oz -o ${CULTIVAR}/normalized_variants.vcf.gz
bcftools index ${CULTIVAR}/normalized_variants.vcf.gz


#Filter variants
bcftools filter -i 'QUAL>30 & DP>10 & GQ>20' -Oz -o ${CULTIVAR}/filtered_variants.vcf.gz ${CULTIVAR}/normalized_variants.vcf.gz
bcftools index ${CULTIVAR}/filtered_variants.vcf.gz

# Generate consensus sequence
bcftools consensus -f $REFERENCE_GENOME ${CULTIVAR}/normalized_variants.vcf.gz > ${CULTIVAR}/${CULTIVAR}_consensus_sequence.fasta

## Delete Picard jar file if not needed anymore
rm -f ${CULTIVAR}/picard.jar

#Prevent terminal from closing
echo "Press any key to exit..."
read -n 1 -s

And here's the script for the user interface to make it usable by colleagues

#!/bin/bash

# Install Zenity if not installed
if ! command -v zenity &> /dev/null
then
    echo "Zenity could not be found, installing..."
    sudo apt-get install -y zenity
fi

# Get user input through Zenity dialogs
CULTIVAR=$(zenity --entry --title="Cultivar Name" --text="Enter the name of the cultivar:")
if [ -z "$CULTIVAR" ]; then
    zenity --error --text="Cultivar name is required."
    exit 1
fi

fastq_file1=$(zenity --file-selection --title="Select forward reads file")
if [ -z "$fastq_file1" ]; then
    zenity --error --text="Forward reads file is required."
    exit 1
fi

fastq_file2=$(zenity --file-selection --title="Select reverse reads file")
if [ -z "$fastq_file2" ]; then
    zenity --error --text="Reverse reads file is required."
    exit 1
fi

REFERENCE_GENOME=$(zenity --file-selection --title="Select Reference Genome")
if [ -z "$REFERENCE_GENOME" ]; then
    zenity --error --text="Reference genome file is required."
    exit 1
fi

# Verify the main genome assembly script exists
if [ ! -f /path/to/Genome_assembly_fastq/genome_assembly_fastq.sh ]; then
    zenity --error --text="Main genome assembly script not found at /path/to/Genome_assembly_fastq/genome_assembly_fastq.sh"
    exit 1
fi

# Run the main genome assembly script with user inputs
/path/to/Genome_assembly_fastq/genome_assembly_fastq.sh "$CULTIVAR" "$fastq_file1" "$fastq_file2" "$REFERENCE_GENOME"

# Check if the script ran successfully
if [ $? -eq 0 ]; then
    zenity --info --text="Genome assembly completed successfully."
else
    zenity --error --text="Genome assembly failed."
fi

Thanks!

assembly genome. script plant genomics • 515 views

ADD COMMENT • link updated 3 months ago by Michael 55k • written 3 months ago by breri ▴ 10

0

Entering edit mode

I would highly encourage you to use a pipeline management software like snakemake or nextflow, it will make things much easier, especially with regards to installing software. Also check out nf-core pipelines (written in nextflow) they probably wrote a very similar, if not identical, pipeline which you can use out of the box or with minor adjustments.

ADD REPLY • link 3 months ago by Asaf 10k

0

Entering edit mode

I'd second that. It will also solve problems of re-entrance, reproducibility, and software versions with less code. On the other hand, it comes with another steep learning curve and a bunch of strings (e.g. using conda) attached. Possibly that is a bit much to take in if you are still learning shell programming. But I do agree in principle.

ADD REPLY • link 3 months ago by Michael 55k

score 2 · Answer 1 · 2024-12-09

Dear Breri,

Congratulations on your first work-flow script. Given the users have sudo privileges and are running on a debian based Linux system (that can be a problem in itself but I won't go into that), I am sure that it runs and produces a FASTA file as output. However, there are some serious caveats. The most important one is that it doesn't produce an assembly in the sense most people here would think of an assembly and therefore strictly speaking the answer to your question would be "No". As the code comments in the script correctly point out your script does variant calling with BCFtools and then calls a "consensus" method which generates an ALT genome. As I have pointed out time and time again this is problematic and counterintuitive, the so-called consensus doesn't generate a consensus: Generating consensus sequence from bam file

You do have some variant filtration, so that is better. Don't get me wrong, BCFtools is a great tookit, but there are more accurate options for variant calling. This option is relatively fast. The essential question is, what do you want to do with the output? Here are some other options:

If you want the SNV matrix for phylogenies, you can use the vcf file directly with some tools, without the detour via bcftools consensus
If you want the best variant calls, even though slower, use GATK HaplotypeCaller
If you want a real consensus, use samtools consensus without the detour via a variant calling step
If you want a de novo assembly use SPAdes or Velvet or some other deBruijn graph assemblers for short reads, even though the assembly won't be great based on short reads only