concatenate all the files with matching prefix in a directory of several files
3
0
Entering edit mode
3.8 years ago

Hi, I have several files in a folder with a different prefix. For example:

HCOMB0001_ATTACTC-TATAGCC_L001_R1.fastq

HCOMB0001_ATTACTC-TATAGCC_L002_R1.fastq

HCOMB0001_ATTACTC-TATAGCC_L002_R2.fastq

HCOMB0002_ATTACTC-ATAGAGG_L001_R1.fastq

HCOMB0002_ATTACTC-ATAGAGG_L002_R1.fastq

HCOMB0002_ATTACTC-ATAGAGG_L002_R2.fastq

HCOMB0003_ATTACTC-CCTATCC_L001_R1.fastq

HCOMB0003_ATTACTC-CCTATCC_L002_R1.fastq

HCOMB0003_ATTACTC-CCTATCC_L002_R2.fastq

. . .

These were sequence output files that I did some processing on so basically L001/L002 or R1/R2 doesn't mean anything anymore (I am not good at modifying the names after processing). All I want is all files with starting string 'HCOMB0001' get merged into one single file and similarly other files with respective prefixes. I looked up many solutions but most of them are based on read names or lane names etc and did not work for me since those parameters do not matter in my case. Any Ubuntu bash script to perform the operation will really be appreciated. Thanks for any help!

concatenate merge • 2.8k views
ADD COMMENT
0
Entering edit mode

try this and remove echo before execution:

for i in $(ls *R1.fastq); do echo ${i%%_*};done | sort | uniq | while read line; do cat $line*.fastq > combined/$line.fastq; done

create a directory (eg. combined) in the same folder where fastqs are located and output would be written to that directory (eg. combined)

ADD REPLY
2
Entering edit mode
3.8 years ago

A little bash for loop. Someone will probably post something more clever than this later.

for id in $(find . -name "*fastq" | cut -f1 -d_ | sort | uniq); do
  cat $(find . -wholename "${id}*") > ${id}.fastq
done
ADD COMMENT
0
Entering edit mode

Thanks, worked flawlessly!

ADD REPLY
3
Entering edit mode
3.8 years ago
Malcolm.Cook ★ 1.5k

If you have it installed already, GNU parallel is so useful for writing one-liners like this (untested)

parallel -j 1 cat {} '>>' {=s/(HCO\w+).*.fastq/$1.fq/=} ::: *.fastq

Note it is important to provide -j 1 to ensure you are in fact not running jobs in parallel but just using parallel for its implicit looping and syntax of job specification. Were you to do otherwise you would risk the lines of multiple fastqs becoming interwoven in the output, arguably not the desired result.

Also note the quoting of the output redirection operator, '>>', is needed to ensure the redirection occurs within each iteration of parallel's implicit loop.

Finally, note as written it uses .fq as output file extension just so your outputs and inputs don't get confused in the same directory should you run this more than once (which if for some reason you did, you would want to first rm *.fq to avoid appending new results to old).

(edit: incorporate @ole.tange's remarks)

ADD COMMENT
1
Entering edit mode

By omitting -j GNU Parallel defaults to running one job per cpu thread. So similar to -j $(nproc). Also you should probably use >> and not >.

ADD REPLY
0
Entering edit mode

agreed. thanks. edited accordingly.

ADD REPLY
0
Entering edit mode

Great idea and solution, thanks!

ADD REPLY
1
Entering edit mode
3.8 years ago
ole.tange ★ 4.5k

You are looking for https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Aggregating-content-of-files

parallel 'eval cat {= s/L001_R1/*/ =} > {= s/_L001_R1//=}' ::: *L001_R1.fastq

This will run:

cat HCOMB0001_ATTACTC-TATAGCC_*.fastq > HCOMB0001_ATTACTC-TATAGCC.fastq

Check that it does what you want before running it for real with:

parallel --dry-run ...

It will run one job per CPU thread. This is fine if your disk is fast. If not, then lower the jobs in parallel:

parallel -j1 ...
ADD COMMENT

Login before adding your answer.

Traffic: 2371 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6