I'm using comm to work out which files have already been processed and which are still to do. The input and output filenames are a little different, so I've used basename and sed to strip away the filepath and suffix information, so they can be compared.
DIR=/Users/michaelflower/Desktop/testing_todo2
TODO=$(comm -3 <(basename -a "$DIR"/*_R1_001.fastq.gz | sed 's/_R1.*//') <(basename -a "$DIR"/results/repeats/*output.txt | sed 's/_repeats_output.*//'))
To make a list of files to be processed by my program I then put the file path and suffix information back in:
for i in $TODO; do echo "$DIR"/${i}"_R1_001.fastq.gz"; done
This is working great, except when the output directory is empty. When there's at least 1 file in the output directory I get the perfect output. But when the output directory is empty (or doesn't yet exist), the first entry in the list is *output.txt. This is messing up the script I'm using these file names in. Any idea how to remove that first entry?
/Users/michaelflower/Desktop/testing_todo2/*output.txt_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF1-JL125CAG-NPC-20210703_S5_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF2-JL125CAG-NPC-20210510_S3_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF3-130CAGiPSC-BL-20210521_S2_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF4-JL180CAG-NPC1-20211211_S4_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF5-JL180CAG-NPCp22-20211211_S1_L001_R1_001.fastq.gz
use a workflow manager like make, snakemake, nextflow....
this is a wrong usage of
comm
. both inputs MUST be sorted.