I am trying to set up a .bat file to run Trimmomatic through multiple fastq.gz files - I am only interested in using ILLUMINACLIP. I keep getting an error with the adapter file path, see example below of a log file with the error message. Any ideas?
TrimmomaticPE: Started with arguments:
-phred33 C:\Users\Shared\CheeseStudy\Trimmomatic\RawCheesefastq\CSM020_v1_S1_L001_R1_001.fastq.gz C:\Users\Shared\CheeseStudy\Trimmomatic\RawCheesefastq\CSM020_v1_S1_L001.fastq_R2_001.fastq.gz C:\Users\Shared\CheeseStudy\Trimmomatic\TrimmoOutputs\CSM020_v1_S1_L001.fastq_R1_paired.fastq.gz C:\Users\Shared\CheeseStudy\Trimmomatic\TrimmoOutputs\CSM020_v1_S1_L001.fastq_R1_unpaired.fastq.gz C:\Users\Shared\CheeseStudy\Trimmomatic\TrimmoOutputs\CSM020_v1_S1_L001.fastq_R2_paired.fastq.gz C:\Users\Shared\CheeseStudy\Trimmomatic\TrimmoOutputs\CSM020_v1_S1_L001.fastq_R2_unpaired.fastq.gz ILLUMINACLIP:C:\Users\Shared\CheeseStudy\Trimmomatic\Trimmomatic-0.39\Trimmomatic-0.39\adapters\NexteraPE-PE.fa:2:30:10:5
Exception in thread "main" java.lang.NumberFormatException: For input string: "\Users\Shared\CheeseStudy\Trimmomatic\Trimmomatic-0.39\Trimmomatic-0.39\adapters\NexteraPE-PE.fa"
at java.lang.NumberFormatException.forInputString(Unknown Source)
at java.lang.Integer.parseInt(Unknown Source)
at java.lang.Integer.parseInt(Unknown Source)
at org.usadellab.trimmomatic.trim.IlluminaClippingTrimmer.makeIlluminaClippingTrimmer(IlluminaClippingTrimmer.java:54)
at org.usadellab.trimmomatic.trim.TrimmerFactory.makeTrimmer(TrimmerFactory.java:32)
at org.usadellab.trimmomatic.Trimmomatic.createTrimmers(Trimmomatic.java:59)
at org.usadellab.trimmomatic.TrimmomaticPE.run(TrimmomaticPE.java:552)
at org.usadellab.trimmomatic.Trimmomatic.main(Trimmomatic.java:80)
I have a number of fastq files - I have 69 participant IDs (e.g., CSM020) and two timepoints for each participant (v1 or v2), and then R1 nd R2 files - so each participant ID has 4 sets of fastqs. Here is an example of what the fastqs look like for one participant:
I am trying to set up a .bat file to run the above trimmomatic settings through all of the fastq files I have in the directory "C:\Users\Shared\CheeseStudy\Trimmomatic\RawCheesefastq" which contains all the files. I am struggling to get it set up correctly in terms of identifying the R1 and R2 file pairs correctly, while also considering the v1 and v2 parameters. I have been trying with code like this (writing .bat file in notepad and running it in cd) but it's constructing the R2 file incorrectly and of course failing . Any ideas?
@echo off
setlocal EnableDelayedExpansion
rem Define base path
set "BASE_PATH=C:\Users\Shared\CheeseStudy\Trimmomatic"
rem Define paths relative to the base path
set "TRIMMOMATIC_PATH=%BASE_PATH%\Trimmomatic-0.39\Trimmomatic-0.39\trimmomatic-0.39.jar"
set "ADAPTERS_PATH=Trimmomatic-0.39\Trimmomatic-0.39\adapters\NexteraPE-PEUpdated.fa"
set "INPUT_DIR=%BASE_PATH%\RawCheesefastq"
set "OUTPUT_DIR=%BASE_PATH%\TrimmoOutputs"
set "LOG_DIR=%BASE_PATH%\TrimmoLog"
rem Create output and log directories if they don't exist
if not exist "%OUTPUT_DIR%" mkdir "%OUTPUT_DIR%"
if not exist "%LOG_DIR%" mkdir "%LOG_DIR%"
rem Change to the base directory to use relative paths
pushd "%BASE_PATH%"
rem Loop through R1 files in the input directory
for %%f in ("%INPUT_DIR%\*R1_001.fastq.gz") do (
rem Extract base name by removing _R1_001.fastq.gz
set "FILENAME=%%~nf"
set "BASE=!FILENAME:_R1_001=!"
rem Construct the corresponding R2 file path
set "FILE_R2=%INPUT_DIR%\!BASE!_R2_001.fastq.gz"
rem Define output file names
set "OUTPUT_R1_PAIRED=%OUTPUT_DIR%\!BASE!_R1_paired.fastq.gz"
set "OUTPUT_R1_UNPAIRED=%OUTPUT_DIR%\!BASE!_R1_unpaired.fastq.gz"
set "OUTPUT_R2_PAIRED=%OUTPUT_DIR%\!BASE!_R2_paired.fastq.gz"
set "OUTPUT_R2_UNPAIRED=%OUTPUT_DIR%\!BASE!_R2_unpaired.fastq.gz"
set "LOG_FILE=%LOG_DIR%\!BASE!.log"
rem Debug: Echo the file paths
echo Processing R1 file: %%f
echo Looking for R2 file: !FILE_R2!
rem Check if the R2 file exists
if exist "!FILE_R2!" (
rem Run Trimmomatic
java -jar "%TRIMMOMATIC_PATH%" PE -phred33 "%%f" "!FILE_R2!" "!OUTPUT_R1_PAIRED!" "!OUTPUT_R1_UNPAIRED!" "!OUTPUT_R2_PAIRED!" "!OUTPUT_R2_UNPAIRED!" ILLUMINACLIP:"%ADAPTERS_PATH%":2:30:10:5 > "!LOG_FILE!"
rem Log completion
echo Done with %%f
) else (
echo File not found: !FILE_R2!
echo Skipping this pair.
echo Skipping this pair. >> "%LOG_DIR%\missing_files.log"
)
)
rem Return to the original directory
popd
endlocal
Bioinformatics, and especially the preprocessing of NGS data are done in Unix environments, not Windows. There is no point debugging Windows errors, as tool developers never had Windows in mind. It's wasted energy. Use a Linux machine or install WSL2 for Windows and your problems are gone.
Thanks, I asked the robot and it seems to be working through the files and the outputs look correct so far - I used the code below in case anyone else is suffering with this in the future:
@echo off
setlocal enabledelayedexpansion
set TRIMMOMATIC_JAR="C:\Users\Shared\CheeseStudy\Trimmomatic\Trimmomatic-0.39\Trimmomatic-0.39\trimmomatic-0.39.jar"
set ADAPTERS="Trimmomatic-0.39\Trimmomatic-0.39\adapters\NexteraPE-PEUpdated.fa"
set LOG_DIR="C:\Users\Shared\CheeseStudy\Trimmomatic\TrimmoLog"
set INPUT_DIR="C:\Users\Shared\CheeseStudy\Trimmomatic\RawCheesefastq"
set OUTPUT_DIR="C:\Users\Shared\CheeseStudy\Trimmomatic\TrimmoOutputs"
:: Initialize an empty list to store participant IDs
set PARTICIPANTS=
:: Loop through files to extract unique participant IDs
for %%F in (%INPUT_DIR%\*_v*_S*_L001_R1_001.fastq.gz) do (
set "filename=%%~nF"
for /f "tokens=1 delims=_" %%A in ("!filename!") do (
if "!PARTICIPANTS!" == "" (
set "PARTICIPANTS=%%A"
) else (
set "found=0"
for %%B in (!PARTICIPANTS!) do (
if "%%A" == "%%B" set "found=1"
)
if !found! == 0 (
set "PARTICIPANTS=!PARTICIPANTS! %%A"
)
)
)
)
:: Debugging output to check participant IDs
echo Participants: %PARTICIPANTS%
:: Loop through each participant
for %%P in (%PARTICIPANTS%) do (
:: Loop through each timepoint
for %%T in (v1 v2) do (
for %%F in (%INPUT_DIR%\%%P_%%T_S*_L001_R1_001.fastq.gz) do (
set R1_FILE=%%F
)
for %%F in (%INPUT_DIR%\%%P_%%T_S*_L001_R2_001.fastq.gz) do (
set R2_FILE=%%F
)
set BASEOUT="%OUTPUT_DIR%\%%P_%%T"
set LOGFILE="%LOG_DIR%\%%P_%%T_Log"
:: Debugging output to check file paths
echo R1_FILE: !R1_FILE!
echo R2_FILE: !R2_FILE!
echo BASEOUT: !BASEOUT!
echo LOGFILE: !LOGFILE!
:: Execute Trimmomatic command
java -jar %TRIMMOMATIC_JAR% PE -phred33 -trimlog !LOGFILE! !R1_FILE! !R2_FILE! !BASEOUT!_1P.fq.gz !BASEOUT!_1U.fq.gz !BASEOUT!_2P.fq.gz !BASEOUT!_2U.fq.gz ILLUMINACLIP:%ADAPTERS%:2:30:10:5
)
)
pause
Thank you.
I am able to run trimmomatic for one file using the following in the windows cd (tasking one sample CSM020 as an example):
I have a number of fastq files - I have 69 participant IDs (e.g., CSM020) and two timepoints for each participant (v1 or v2), and then R1 nd R2 files - so each participant ID has 4 sets of fastqs. Here is an example of what the fastqs look like for one participant:
I am trying to set up a .bat file to run the above trimmomatic settings through all of the fastq files I have in the directory "C:\Users\Shared\CheeseStudy\Trimmomatic\RawCheesefastq" which contains all the files. I am struggling to get it set up correctly in terms of identifying the R1 and R2 file pairs correctly, while also considering the v1 and v2 parameters. I have been trying with code like this (writing .bat file in notepad and running it in cd) but it's constructing the R2 file incorrectly and of course failing . Any ideas?
You are brave. Writing
.bat
files for processing of NGS data on windows. :-)I suggest using ChatGPT to see if it can catch the problem you are having with recreating the R2 file name.