Listen to ATpoint 's advice, don't directly work on the original files before ensuring everything is safe.
I'd recommend using brename, again =]
At first glance, I thought it is a simple task. But ... check the report below:
$ brename -p '.+_(PS\d+).+(_[12]).+' -r '$1$2.fastq.gz' -d
[INFO] main options:
[INFO] ignore case: false
[INFO] search pattern: .+_(PS\d+).+(_[12]).+
[INFO] include filters: .
[INFO] search paths: ./
[INFO]
[INFO] checking: [ ok ] 'XG-31313_PS33_lib631817_10106_3_2.fastq.gz' -> 'PS33_2.fastq.gz'
[INFO] checking: [ ok ] 'XG-31313_PS34_lib631818_10106_3_1.fastq.gz' -> 'PS34_1.fastq.gz'
[INFO] checking: [ ok ] 'XG-31313_PS34_lib631818_10106_3_2.fastq.gz' -> 'PS34_2.fastq.gz'
[ERRO] checking: [ overwriting newly renamed path ] 'XG-31313_PS34_lib631818_10107_2_1.fastq.gz' -> 'PS34_1.fastq.gz'
[ERRO] checking: [ overwriting newly renamed path ] 'XG-31313_PS34_lib631818_10107_2_2.fastq.gz' -> 'PS34_2.fastq.gz'
[INFO] checking: [ ok ] 'XG-31313_PS35_lib631819_10106_3_1.fastq.gz' -> 'PS35_1.fastq.gz'
[INFO] checking: [ ok ] 'XG-31313_PS35_lib631819_10106_3_2.fastq.gz' -> 'PS35_2.fastq.gz'
[INFO] checking: [ ok ] 'XG-31313_PS36_lib631820_10106_3_1.fastq.gz' -> 'PS36_1.fastq.gz'
[INFO] checking: [ ok ] 'XG-31313_PS36_lib631820_10106_3_2.fastq.gz' -> 'PS36_2.fastq.gz'
[ERRO] checking: [ overwriting newly renamed path ] 'XG-31313_PS36_lib631820_10107_2_1.fastq.gz' -> 'PS36_1.fastq.gz'
[ERRO] checking: [ overwriting newly renamed path ] 'XG-31313_PS36_lib631820_10107_2_2.fastq.gz' -> 'PS36_2.fastq.gz'
[ERRO] 4 potential error(s) detected, please check
See the files again:
file1 XG-31313_PS34_lib631818_10106_3_1.fastq.gz -> PS34_1.fastq.gz It's OK
file2 XG-31313_PS34_lib631818_10106_3_2.fastq.gz
file3 XG-31313_PS34_lib631818_10107_2_1.fastq.gz -> PS34_1.fastq.gz Danger!!!! It overwrites the new PS34_1.fastq.gz (original file1)
file4 XG-31313_PS34_lib631818_10107_2_2.fastq.gz
The consequence is that you'll lose file1 and file2.
So you need to concatenate file1
and file2
first!
A safe answers is (csvtk and rush are needed):
# check files
ls *.fastq.gz \
| csvtk mutate -Ht \
| csvtk replace -Ht -p '.+_(PS\d+).+(_[12]).+' -r '$1$2.fastq.gz' \
| csvtk fold -Ht -f 1 -v 2 -s ' '
PS33_2.fastq.gz XG-31313_PS33_lib631817_10106_3_2.fastq.gz
PS34_1.fastq.gz XG-31313_PS34_lib631818_10106_3_1.fastq.gz XG-31313_PS34_lib631818_10107_2_1.fastq.gz
PS34_2.fastq.gz XG-31313_PS34_lib631818_10106_3_2.fastq.gz XG-31313_PS34_lib631818_10107_2_2.fastq.gz
PS35_1.fastq.gz XG-31313_PS35_lib631819_10106_3_1.fastq.gz
PS35_2.fastq.gz XG-31313_PS35_lib631819_10106_3_2.fastq.gz
PS36_1.fastq.gz XG-31313_PS36_lib631820_10106_3_1.fastq.gz XG-31313_PS36_lib631820_10107_2_1.fastq.gz
PS36_2.fastq.gz XG-31313_PS36_lib631820_10106_3_2.fastq.gz XG-31313_PS36_lib631820_10107_2_2.fastq.gz
# ready to go
ls *.fastq.gz \
| csvtk mutate -Ht \
| csvtk replace -Ht -p '.+_(PS\d+).+(_[12]).+' -r '$1$2.fastq.gz' \
| csvtk fold -Ht -f 1 -v 2 -s ' ' \
| rush -j 1 -d "\t" 'cat {2} > {1}' --dry-run
cat XG-31313_PS33_lib631817_10106_3_2.fastq.gz > PS33_2.fastq.gz
cat XG-31313_PS34_lib631818_10106_3_1.fastq.gz XG-31313_PS34_lib631818_10107_2_1.fastq.gz > PS34_1.fastq.gz
cat XG-31313_PS34_lib631818_10106_3_2.fastq.gz XG-31313_PS34_lib631818_10107_2_2.fastq.gz > PS34_2.fastq.gz
cat XG-31313_PS35_lib631819_10106_3_1.fastq.gz > PS35_1.fastq.gz
cat XG-31313_PS35_lib631819_10106_3_2.fastq.gz > PS35_2.fastq.gz
cat XG-31313_PS36_lib631820_10106_3_1.fastq.gz XG-31313_PS36_lib631820_10107_2_1.fastq.gz > PS36_1.fastq.gz
cat XG-31313_PS36_lib631820_10106_3_2.fastq.gz XG-31313_PS36_lib631820_10107_2_2.fastq.gz > PS36_2.fastq.gz
# remove --dry-run to apply the renaming.
Be sure to not touch the original files, make a new directory and symlink the files into it. Then test commands on these links until it works properly. You do not want to test on original data.
Assuming the list of files is in
test
(removeecho
beforemove
when ready to execute :NOTE 1: If you want to be super careful
mv
can be replaced by acp
so the originals files will remain intact.NOTE2: It appears that there are identical files with same names if we simply act on the parts OP had asked to remove. So I am moving my example to a comment. It may still help someone else when the file names are not going to overlap.
Sorry GenoMax, but the commands are dangerous here.
That is main reason I have an
echo
in my example. OP needs to understand what is happening before they execute the commands.In any case because of the duplicate file name issue that you pointed out below, I am moving my post to a comment.
bionix - Concatenating paied-end data files end to end may make them unusable in most programs. This is not how programs expect paired end data to be present in files. There is a particular format called "interleaved" fastq files. This would be proper way of handling paired-end data.