Concatenating 4 files into 1
2
0
Entering edit mode
22 months ago
Roland ▴ 20

Hi.

I'm trying to concatenate 4 files into one. This is how my raw data looks like:

> S9_L001_R1_001_1P.fq.gz  

> S9_L001_R1_001_1U.fq.gz

> S9_L001_R1_001_2P.fq.gz 

> S9_L001_R1_001_2U.fq.gz

> S10_L001_R1_001_1P.fq.gz  

> S10_L001_R1_001_1U.fq.gz

> S10_L001_R1_001_2P.fq.gz 

> S10_L001_R1_001_2U.fq.gz

I have twenty samples (S1-20) and all samples consist of four files (1P, 1U, 2P and 2U). The code I've come up with but that doesn't work looks like this:

for i in {1..20}; 
do for j in 1 2; 
do cat S${i}_L001_R1_001_${j}*.fq.gz >S${i}_concatenate.fq.gz; done; done

It only concatenates any 2 files from each sample.

Any suggestions? Thanks.

Concatenate • 1.5k views
ADD COMMENT
0
Entering edit mode

I hope there is a reason you are trying to cat these together. Based on the names it looks like these are properly paired and unpaired reads after trimming.

You code is ignoring the 1P, 1U, 2P and 2U in names. What order do you want to concatenate those pieces in?

ADD REPLY
0
Entering edit mode

Since I'm not mapping the reads to a reference genome or building my own, I figured I might as well treat them as single end reads.

I don't think it matters what order I map them in, but I guess 1P-1U-2P-2U

ADD REPLY
1
Entering edit mode
22 months ago
DavidStreid ▴ 90

Change the > to >> in the inner loop

  • > Writes a new file, overwriting anything already there
  • >> Also creates a new file, but will append to the existing file if present

Your code only writes the two S${i}_L001_R1_001_2*.fq.gz files for any given i because it is overwriting the output of the S${i}_L001_R1_001_1*.fq.gz files in the second pass through the inner loop

for i in {1..20}; do 
  for j in 1 2; do
    # ONLY CHANGE: ">" => ">>"
    cat S${i}_L001_R1_001_${j}*.fq.gz >> S${i}_concatenate.fq.gz;
  done
done
ADD COMMENT
1
Entering edit mode

Thank you so much! This worked.

ADD REPLY
0
Entering edit mode

Good luck, np!

ADD REPLY
0
Entering edit mode
22 months ago
Mensur Dlakic ★ 28k

I am all for writing code to support tedious tasks, and I hope you get your answer. That said, it seems easier to type cat and paste 10 names, and do so twice, than to wait for responses here.

From what I can tell, the only thing that needs changing is * to ?

for i in {1..20};
do for j in 1 2;
do cat S${i}_L001_R1_001_${j}?.fq.gz > S${i}_concatenate.fq.gz; done; done

When in doubt, I suggest you put an echo command in front of your actual command. It will print everything on screen without executing it, so it may be easier to troubleshoot what is wrong.

for i in {1..20};
do for j in 1 2;
do echo "cat S${i}_L001_R1_001_${j}?.fq.gz > S${i}_concatenate.fq.gz" ; done; done
ADD COMMENT
0
Entering edit mode

Maybe this will do the trick:

for i in {1..20};
do cat S${i}_L001_R1_001_??.fq.gz > S${i}_concatenate.fq.gz; done
ADD REPLY
0
Entering edit mode

The ? vs. ?? do?

Ah just tried it, the ? is very helpful as a wildcard - thank you

ADD REPLY
0
Entering edit mode

Thank you for your help. I'm currently working with some "test" samples in preparation for my real data which consists of well over 200 samples, so that's why I'd like to have it automated!

ADD REPLY

Login before adding your answer.

Traffic: 1869 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6