Entering edit mode
22 months ago
genomes_and_MGEs
▴
10
Hi there,
I have multiple sequencing reads with the following structure (the below is just an example):
> MSP3_run719_TCATCCTA_S65_L004_R1_001.fastq.gz
> MSP3_run720_TCATCCTA_S45_L001_R1_001.fastq.gz
> MSP3_run720_TCATCCTA_S45_L002_R1_001.fastq.gz
> MSP3_run719_TCATCCTA_S65_L004_R2_001.fastq.gz
> MSP3_run720_TCATCCTA_S45_L001_R2_001.fastq.gz
> MSP3_run720_TCATCCTA_S45_L002_R2_001.fastq.gz
My goal is to combine reads that have the same names in columns 1 and 6 (delimiter _), and have an output as follows:
> MSP3_R1.fastq.gz
> MSP3_R2.fastq.gz
I tried to run the following command, but didn't work:
for file in MSP3_*_R1_*.fastq.gz ; do cat "$file" >>"${file%_*}.comb" ; done
Can someone help me out? Thanks!
I think you just need
${file%%_*}
, with two%
, to get the "MSP3" part:(Or whatever file extension you want.)
With one
%
it only works on the last match, while with two it goes for the first match. Same idea in the other direction for#
and##
.To handle R1/R2 automatically, you'd probably just use some other string manipulation commands. (I end up abusing
cut
a lot of that sort of thing.) Just be careful since that can be a brittle way of coming at it. Like maybe: