Question

concatenate all the files with matching prefix in a directory of several files

0

Entering edit mode

4.5 years ago

prashantwaiker • 0

Hi, I have several files in a folder with a different prefix. For example:

HCOMB0001_ATTACTC-TATAGCC_L001_R1.fastq

HCOMB0001_ATTACTC-TATAGCC_L002_R1.fastq

HCOMB0001_ATTACTC-TATAGCC_L002_R2.fastq

HCOMB0002_ATTACTC-ATAGAGG_L001_R1.fastq

HCOMB0002_ATTACTC-ATAGAGG_L002_R1.fastq

HCOMB0002_ATTACTC-ATAGAGG_L002_R2.fastq

HCOMB0003_ATTACTC-CCTATCC_L001_R1.fastq

HCOMB0003_ATTACTC-CCTATCC_L002_R1.fastq

HCOMB0003_ATTACTC-CCTATCC_L002_R2.fastq

. . .

These were sequence output files that I did some processing on so basically L001/L002 or R1/R2 doesn't mean anything anymore (I am not good at modifying the names after processing). All I want is all files with starting string 'HCOMB0001' get merged into one single file and similarly other files with respective prefixes. I looked up many solutions but most of them are based on read names or lane names etc and did not work for me since those parameters do not matter in my case. Any Ubuntu bash script to perform the operation will really be appreciated. Thanks for any help!

concatenate merge • 3.6k views

ADD COMMENT • link updated 4.5 years ago by ole.tange ★ 4.5k • written 4.5 years ago by prashantwaiker • 0

0

Entering edit mode

try this and remove echo before execution:

for i in $(ls *R1.fastq); do echo ${i%%_*};done | sort | uniq | while read line; do cat $line*.fastq > combined/$line.fastq; done

create a directory (eg. combined) in the same folder where fastqs are located and output would be written to that directory (eg. combined)

ADD REPLY • link 4.5 years ago by cpad0112 21k

3

Entering edit mode

4.5 years ago

Malcolm.Cook ★ 1.5k

If you have it installed already, GNU parallel is so useful for writing one-liners like this (untested)

parallel -j 1 cat {} '>>' {=s/(HCO\w+).*.fastq/$1.fq/=} ::: *.fastq

Note it is important to provide -j 1 to ensure you are in fact not running jobs in parallel but just using parallel for its implicit looping and syntax of job specification. Were you to do otherwise you would risk the lines of multiple fastqs becoming interwoven in the output, arguably not the desired result.

Also note the quoting of the output redirection operator, '>>', is needed to ensure the redirection occurs within each iteration of parallel's implicit loop.

Finally, note as written it uses .fq as output file extension just so your outputs and inputs don't get confused in the same directory should you run this more than once (which if for some reason you did, you would want to first rm *.fq to avoid appending new results to old).

(edit: incorporate @ole.tange's remarks)

ADD COMMENT • link 4.5 years ago by Malcolm.Cook ★ 1.5k

1

Entering edit mode

By omitting -j GNU Parallel defaults to running one job per cpu thread. So similar to -j $(nproc). Also you should probably use >> and not >.

ADD REPLY • link 4.5 years ago by ole.tange ★ 4.5k

0

Entering edit mode

agreed. thanks. edited accordingly.

ADD REPLY • link 4.5 years ago by Malcolm.Cook ★ 1.5k

0

Entering edit mode

Great idea and solution, thanks!

ADD REPLY • link 4.5 years ago by prashantwaiker • 0

1

Entering edit mode

4.5 years ago

ole.tange ★ 4.5k

You are looking for https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Aggregating-content-of-files

parallel 'eval cat {= s/L001_R1/*/ =} > {= s/_L001_R1//=}' ::: *L001_R1.fastq

This will run:

cat HCOMB0001_ATTACTC-TATAGCC_*.fastq > HCOMB0001_ATTACTC-TATAGCC.fastq

Check that it does what you want before running it for real with:

parallel --dry-run ...

It will run one job per CPU thread. This is fine if your disk is fast. If not, then lower the jobs in parallel:

parallel -j1 ...

ADD COMMENT • link 4.5 years ago by ole.tange ★ 4.5k

score 2 · Accepted Answer · 2021-02-02

2

Entering edit mode

4.5 years ago

rpolicastro 13k

A little bash for loop. Someone will probably post something more clever than this later.

for id in $(find . -name "*fastq" | cut -f1 -d_ | sort | uniq); do
  cat $(find . -wholename "${id}*") > ${id}.fastq
done