changing the name of files
6
3
Entering edit mode
6.5 years ago
Sam ▴ 150

Dear All

I have about 200 of libs with this naming format ALT1_1_clean.fq.gz but I have to change the name format to be recognized by pipeline. could you guide me about this?

Thanks

     "ALT1_1_clean.fq.gz" change to "ALT_1.R1.fq.gz"
    "ALT1_2_clean.fq.gz"  change to " ALT_1.R2.fq.gz"
    "ALT2_1_clean.fq.gz" change to " ALT_2.R1.fq.gz"
    "ALT2_2_clean.fq.gz" change to " ALT_2.R2.fq.gz"
    .
    .
    .
bash awk • 3.8k views
ADD COMMENT
5
Entering edit mode
6.5 years ago
Eric Lim ★ 2.2k

There are countless ways to accomplish such bash operation, but I always prefer to write simple rules in snakemake.

# mvfq.py
rule:
    input: expand('{samples}_{reads}.fq.gz', samples=['ALT_1', 'ALT_2'], reads=['R1', 'R2'])

rule move_fqs:
    output: mvto = '{sample}_{read}.fq.gz'
    run:
        mvfrom = '_'.join([wildcards.sample.replace('_',''), wildcards.read.replace('R',''), 'clean.fq.gz'])
        shell('mv {mvfrom} {output.mvto}')

I can dryrun it

snakemake -s mvfq.py --dryrun

or run a specific target to make sure everything is working

snakemake -s mvfq.py ALT_1_R1.fq.gz

or run it all on my laptop

snakemake -s mvfq.py

or run it using 4 cores

snakemake -s mvfq.py -j4

or in a cluster via qsub with 100 independent jobs

snakemake -s mvfq.py -j100 -c "qsub"

or using remote files at S3 (or dropbox, google drive, etc) in a cluster

snakemake -s mvfq.py -j100 -c "qsub" --default-remote-provider S3 --default-remote-prefix s3/location/

or I can restart from the last failure check points, and many more.

All without changing the underlying code.

ADD COMMENT
4
Entering edit mode
6.5 years ago
ls *_clean.fq.gz | while read F; do mv "$F" $( echo "${F}" | sed 's/_\([12]\)_clean.fq.gz/.R\1.fq.gz/;s/ALT/ALT_/') ; done
ADD COMMENT
4
Entering edit mode
6.5 years ago
igor 13k

The easiest and most readable option (in my opinion):

rename ALT ALT_ *.fq.gz
rename _1_clean .R1 *.fq.gz
rename _2_clean .R2 *.fq.gz

Unfortunately, the rename utility may not be available on all systems.

ADD COMMENT
3
Entering edit mode
6.5 years ago
h.mon 35k

Honestly, change the source code of the pipeline. If this is not possible, here is a one-liner rename (which, as igor noted, may not be available or installed on some systems):

rename 's/(\d)_(\d)_clean.fq.gz/_$1.R$2.fq.gz/' *.gz

Note the single quotes ', is you use double quotes " the capture will not work. As batch-renaming can have catastrophic consequences, I suggest you first perform a fry-run with -n, check if everything is good to go, then proceed with the renaming by not using -n.

ADD COMMENT
1
Entering edit mode

And to make things even more complicated, the rename tool linked by igor in another answer is not the same as the rename tool in this answer, which is available at https://metacpan.org/release/File-Rename, and in the rename package on Debian and related systems.

ADD REPLY
0
Entering edit mode

Indeed, good point, which I overlooked. There are renames and renames around, this one is a Perl script, that other one is a binary executable, and in Debian and relatives is called rename.ul.

That is a lot of answers for a "how to rename files" question...

ADD REPLY
0
Entering edit mode

I guess this can be further shortened (code) and extended (function) by:

$ rename -n 's/(\d+)_(\d+)_clean/_$1.R$2/' *.gz
ADD REPLY
0
Entering edit mode

To further complicate things, I don't think every rename has the -n flag. Mine (from util-linux-ng) does not.

ADD REPLY
2
Entering edit mode
6.5 years ago

Assuming that the files follow same pattern (esp digit_digit pattern)

$  parallel cp {} '{= s:([0-9]+)_([0-9]+)_clean:_$1\.R$2: =}' ::: *.gz
ADD COMMENT
1
Entering edit mode
6.5 years ago

---- corrected answer----

Try brename, a practical cross-platform command-line tool for safely batch renaming files/directories via regular expression.

$ brename -p "(\d+)_(\d+)_clean" -r "_\$1.R\$2"
[INFO] checking: [ ok ] 'ALT1_1_clean.fq.gz' -> 'ALT_1.R1.fq.gz'
[INFO] checking: [ ok ] 'ALT1_2_clean.fq.gz' -> 'ALT_1.R2.fq.gz'
[INFO] checking: [ ok ] 'ALT2_1_clean.fq.gz' -> 'ALT_2.R1.fq.gz'
[INFO] checking: [ ok ] 'ALT2_2_clean.fq.gz' -> 'ALT_2.R2.fq.gz'
[INFO] 4 path(s) to be renamed
[INFO] renamed: 'ALT1_1_clean.fq.gz' -> 'ALT_1.R1.fq.gz'
[INFO] renamed: 'ALT1_2_clean.fq.gz' -> 'ALT_1.R2.fq.gz'
[INFO] renamed: 'ALT2_1_clean.fq.gz' -> 'ALT_2.R1.fq.gz'
[INFO] renamed: 'ALT2_2_clean.fq.gz' -> 'ALT_2.R2.fq.gz'
[INFO] 4 path(s) renamed
ADD COMMENT
0
Entering edit mode

That is not quite what OP wanted.

ADD REPLY
0
Entering edit mode

Sorry for my carelessness, it's fixed.

ADD REPLY
0
Entering edit mode

No worries. Your software is always comprehensive. Nice that you have sanity check built in before the changes are made. I assume software will stop if a test fails?

ADD REPLY
0
Entering edit mode

Right, it detects potential conflicts (overwriting existed paths and overwriting newly renamed path) and errors (blank target).

ADD REPLY

Login before adding your answer.

Traffic: 1625 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6