Question

Combine files that have partial filename matches

0

Entering edit mode

2.3 years ago

genomes_and_MGEs ▴ 10

Hi there,

I have multiple sequencing reads with the following structure (the below is just an example):

> MSP3_run719_TCATCCTA_S65_L004_R1_001.fastq.gz
> MSP3_run720_TCATCCTA_S45_L001_R1_001.fastq.gz
> MSP3_run720_TCATCCTA_S45_L002_R1_001.fastq.gz
> MSP3_run719_TCATCCTA_S65_L004_R2_001.fastq.gz
> MSP3_run720_TCATCCTA_S45_L001_R2_001.fastq.gz
> MSP3_run720_TCATCCTA_S45_L002_R2_001.fastq.gz

My goal is to combine reads that have the same names in columns 1 and 6 (delimiter _), and have an output as follows:

> MSP3_R1.fastq.gz
> MSP3_R2.fastq.gz

I tried to run the following command, but didn't work:

for file in MSP3_*_R1_*.fastq.gz ; do cat "$file" >>"${file%_*}.comb" ; done

Can someone help me out? Thanks!

Sequence • 775 views

ADD COMMENT • link updated 2.3 years ago by Pierre Lindenbaum 166k • written 2.3 years ago by genomes_and_MGEs ▴ 10

1

Entering edit mode

I think you just need ${file%%_*}, with two %, to get the "MSP3" part:

for file in MSP3_*_R1_*.fastq.gz ; do cat "$file" >>"${file%%_*}_R1.comb" ; done

(Or whatever file extension you want.)

With one % it only works on the last match, while with two it goes for the first match. Same idea in the other direction for # and ##.

$ x="one_two_three"
echo ${x%_*}
one_two
$ echo ${x%%_*}
one
$ echo ${x#*_}
two_three
$ echo ${x##*_}
three

To handle R1/R2 automatically, you'd probably just use some other string manipulation commands. (I end up abusing cut a lot of that sort of thing.) Just be careful since that can be a brittle way of coming at it. Like maybe:

for file in MSP3_*_R1_*.fastq.gz ; do cat "$file" >> "$(echo $file | cut -f 1,6 -d _).comb"; done

ADD REPLY • link 2.3 years ago by Jesse ▴ 870

score 0 · Answer 1 · 2023-01-25

0

Entering edit mode

2.3 years ago

mohammadhassanj ▴ 260

Hi

 for file in *.gz;do
     newFile=$(echo "$file"  | awk -F'_' '{print $1"_"$6}');
     cat $file >> $newFile.fastq.gz ;
 done

ADD COMMENT • link 2.3 years ago by mohammadhassanj ▴ 260