Entering edit mode
3.0 years ago
michael.flower.14
▴
200
I'm using diff to work out which files have already been processed and which are still to do. The input and output filenames are a little different, so I've used basename
and sed
to strip away the filepath and suffix information, so they can be compared.
TODO=$(diff -s <(basename -a ./data/*_R1_001.fastq.gz | sed 's/_R1.*//') <(basename -a ./results/repeats/*output.txt | sed 's/_repeats_output.*//'))
echo $TODO
This outputs
1d0 < MF3-130CAGiPSC-BL-20210521_S2_L001
Which is exactly what I want, except for the 1d0 <
bit. I've been looking at the diff manual, and can't see how to get it to just output the filename and not it's default syntax (1d0 <
). Any help please!
This is great, and I accepted it, but I've got one other question. I'm now trying to add my filepath and suffix back onto the output and am having difficulty.
produces:
But what I want is the filepath and suffix added to each filename! I want to use this as the input to a function to list the files it needs to operate on
Not sure where is $DIR coming from but
echo ${DIR}/${TODO}_R1_001.fastq.gz
should be all you need. You can capture dir name usingdirname
command.This might make it clearer. You'll see I'm trying to add "/Users/michaelflower/Desktop/JL_MSH3" before each filename in $TODO and add "_R1_001.fastq.gz" after each.
However, what I get is "/Users/michaelflower/Desktop/JL_MSH3", then all three filenames as a string with spaces in between, then just one "_R1_001.fastq.gz" at the end.
How do I get the comm output to be file names, as separate entities, so that we can add the prefix and suffix to each?
I see so that first part can be fixed by:
echo "$DIR"/${TODO}"_R1_001.fastq.gz"
comm -12 file1 file2
should only print lines that are common in two files.I'm afraid I'm still getting the same problem. It seems to interpret the comm output as a single string, rather than separate filenames ...
See the following:
My hero, that for loop works perfectly!!
This is working great, except when the output directory is empty. When there's at least 1 file in the output directory I get the perfect output. But when the output directory is empty (or doesn't yet exist), the first entry in the list is
*output.txt
. This is messing up the script I'm using these file names in. Any idea how to remove that first entry?