Cat many paired fasta files in directory in terminal
1
1
Entering edit mode
2.8 years ago
Laura ▴ 50

Hello!

Today I am working on concatenating paired files. In a directory, I have 300 paired files. Here's an example of the file names:

flL1_5495_L1PA6_reactivating_recurring_12_2bit_c.bed.fa.revcom.fa
flL1_5495_L1PA6_reactivating_recurring_12_2bit_plus.bed.fa
flL1_5495_L1PA6_reactivating_recurring_13_2bit_c.bed.fa.revcom.fa
flL1_5495_L1PA6_reactivating_recurring_13_2bit_plus.bed.fa
flL1_5495_L1PA8A_reactivating_recurring_03_2bit_c.bed.fa.revcom.fa
flL1_5495_L1PA8A_reactivating_recurring_03_2bit_plus.bed.fa
flL1_5495_L1PA8A_reactivating_recurring_04_2bit_c.bed.fa.revcom.fa
flL1_5495_L1PA8A_reactivating_recurring_04_2bit_plus.bed.fa

The file names are identical except for a family name (L1PA6 and L1PA8A in this example, but there are a few more), a level (12, 13, 03, and 04 in this example, levels range from 03 to 38 ultimately), and whether they are "plus" or "revcom". There is a matching revcom file for each plus file which I would like to concatenate. I have been working with a nested bash loop like so:

#!/bin/bash

subfamilies=( \
  L1HS  L1PA2 L1PA3 L1PA4  L1PA5 \
  L1PA6 L1PA7 L1PA8 L1PA8A L1PA10 )

recurrence=( \
  03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 \
  19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 \
  35 36 37 38 ) 

for subfam in ${subfamilies[@]}; do
  for level in ${recurrence[@]}; do
  cat *${subfamilies[@]}_reactivating_recurring_${recurrence[@]}*.fa \
  > ${subfam}${recurrence}.fa
  done
done

I am trying to concatenate based on shared family names and levels, but my output is a mess. I get a lot of empty files. I'd like the output name to be simpler, something like "L1PA6_03.fa"

Maybe there's a better way to do this?

Thanks in advance!

bash linux cat terminal fasta • 1.2k views
ADD COMMENT
1
Entering edit mode

Don't hard code variables when you can use regex:

$ for i in *revcom*.fa; do echo ${i%%_2bit*}*.fa ${i%%_2bit*};done

flL1_5495_L1PA6_reactivating_recurring_12_2bit_c.bed.fa.revcom.fa flL1_5495_L1PA6_reactivating_recurring_12_2bit_plus.bed.fa flL1_5495_L1PA6_reactivating_recurring_12
flL1_5495_L1PA6_reactivating_recurring_13_2bit_c.bed.fa.revcom.fa flL1_5495_L1PA6_reactivating_recurring_13_2bit_plus.bed.fa flL1_5495_L1PA6_reactivating_recurring_13
flL1_5495_L1PA8A_reactivating_recurring_03_2bit_c.bed.fa.revcom.fa flL1_5495_L1PA8A_reactivating_recurring_03_2bit_plus.bed.fa flL1_5495_L1PA8A_reactivating_recurring_03
flL1_5495_L1PA8A_reactivating_recurring_04_2bit_c.bed.fa.revcom.fa flL1_5495_L1PA8A_reactivating_recurring_04_2bit_plus.bed.fa flL1_5495_L1PA8A_reactivating_recurring_04

with parallel (in bash):

$ parallel --plus echo {=s/_2bit.*//=}*.fa {=s/_2bit.*//=}.fasta ::: *revcom*.fa

flL1_5495_L1PA6_reactivating_recurring_12_2bit_c.bed.fa.revcom.fa flL1_5495_L1PA6_reactivating_recurring_12_2bit_plus.bed.fa flL1_5495_L1PA6_reactivating_recurring_12.fasta
flL1_5495_L1PA6_reactivating_recurring_13_2bit_c.bed.fa.revcom.fa flL1_5495_L1PA6_reactivating_recurring_13_2bit_plus.bed.fa flL1_5495_L1PA6_reactivating_recurring_13.fasta
flL1_5495_L1PA8A_reactivating_recurring_03_2bit_c.bed.fa.revcom.fa flL1_5495_L1PA8A_reactivating_recurring_03_2bit_plus.bed.fa flL1_5495_L1PA8A_reactivating_recurring_03.fasta
flL1_5495_L1PA8A_reactivating_recurring_04_2bit_c.bed.fa.revcom.fa flL1_5495_L1PA8A_reactivating_recurring_04_2bit_plus.bed.fa flL1_5495_L1PA8A_reactivating_recurring_04.fasta
ADD REPLY
0
Entering edit mode

Thank you, I appreciate the education on variables/regex.

ADD REPLY
0
Entering edit mode

your 'level' from the second loop is not used in your script?

(in stead you still use ${recurrence[@]} )

did not (yet) put much thought in it but shouldn't you need to use $subfam and $recurrence (which should be $level according to your variable names) in stead of ${subfamilies[@]} and ${recurrence[@]} )

Moreover, for for instance flL1_5495_L1PA6 , you only have _12 and _13 , so all other numbers from your recurrence will indeed give empty files.

ADD REPLY
3
Entering edit mode
2.8 years ago
Charles Plessy ★ 2.9k

How about something like:

for baseName in $(ls f* | sed 's/_2bit.*//' | uniq)
do
  cat ${baseName}* > $baseName.fa
done

The idea is that the common part in the name of the files that have to be paired is everything before _2bit. So we list all names, truncate them with sed, and remove redundant entries. Enclosing this command in $(), we pass it as a plain space-separated string to for, and assign a baseName variable to loop on. The expression ${baseName}* expand to the name of the two files to join together, and we reuse the $baseName to construct the output file name.

ADD COMMENT
0
Entering edit mode

indeed, that looks about right to me as well.

ADD REPLY
0
Entering edit mode

Thank you so much! This is such an improvement. I appreciate the description too :)

ADD REPLY

Login before adding your answer.

Traffic: 2556 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6