Question

Rename FASTA headers based on filename

0

Entering edit mode

5.6 years ago

SaltedPork ▴ 170

Hi

FASTA header looks like:

>1570-13.segment.flu1_PB2
>1570-13.segment.flu2_PB1
>1570-13.segment.flu3_PA

etc

Filenames looks like:

201301234.fasta

I want to have FASTA headers that looks like:

>201301234_PB2
>201301234_PB1
>201301234_PA

I have seen this answer: Change header of a Fasta file according to the file name How can I modify this to preserve the _PB2...?

bash • 3.6k views

ADD COMMENT • link updated 5.6 years ago by AK ★ 2.2k • written 5.6 years ago by SaltedPork ▴ 170

3

Entering edit mode

5.6 years ago

bari.ballew ▴ 470

Just using bash:

for i in *fasta; do n="${i%.fasta}"; sed -i.bak "s/>[^_]\+/>$n/" $i; done

This loops over all files in the current directory that end with "fasta". For each file:

n="${i%.fasta}" removes the .fasta file extension (can be generalized to any extension by using n="${i%.*}")
sed "s/>[^_]\+/>$n/" matches a string in the file that starts with ">" and is followed by any character that's not an underscore, and replaces it with the filename minus extension found in the previous step. Depending on your requirements, you may want to tighten up this regex.
The -i.bak part just tells sed to replace the string in the original file, but make a backup called <originalname>.bak.

ADD COMMENT • link 5.6 years ago by bari.ballew ▴ 470

1

Entering edit mode

5.6 years ago

lakhujanivijay 5.9k

Using seqkit : replace

seqkit replace -p '.segment.flu1' -r '' <your_fasta_file>

Explanation

 replace = name/sequence by regular expression.

-p, --pattern string         search regular expression
-r, --replacement string     replacement. supporting capture variables

ADD COMMENT • link 5.6 years ago by lakhujanivijay 5.9k

score 2 · Accepted Answer · 2019-05-22

2

Entering edit mode

5.6 years ago

AK ★ 2.2k

Hi SaltedPork,

Try:

for i in $(ls *.fasta); do fname=$(basename ${i} .fasta); perl -pe "s/^>.+_/>${fname}_/g" ${fname}.fasta > reID_${fname}.fasta; done

Edited (better use *.fasta, see response from RamRS):

for i in *.fasta; do fname=$(basename ${i} .fasta); perl -pe "s/^>.+_/>${fname}_/g" ${fname}.fasta > reID_${fname}.fasta; done

ADD COMMENT • link 5.6 years ago by AK ★ 2.2k

0

Entering edit mode

I'd recommend for i in *.fasta instead of for i in $(ls *.fasta) - the latter adds a sub-shell where a glob would suffice. Plus, ls can get unpredictable if customized and IIRC filenames can cause a problem with the ls sub-shell method too.

ADD REPLY • link 5.6 years ago by Ram 44k

0

Entering edit mode

Thanks, RamRS!

ls can get unpredictable if customized and IIRC filenames can cause a problem with the ls sub-shell method too

Can you give some examples of this?

ADD REPLY • link 5.6 years ago by AK ★ 2.2k

0

Entering edit mode

I have a heavily customized shell. My ls is an example. My LSCOLORS setting interferes with the filename here. See sample output:

➜ for f in $(ls *.gz)
> file $f

hs37d5_GRCm38p6.fasta.gz: cannot open `\033[0m\033[38;5;9mhs37d5_GRCm38p6.fasta.gz\033[0m' (No such file or directory)
hs37d5_GRCm38p6.reheader.fasta.gzip.gz: cannot open `\033[38;5;9mhs37d5_GRCm38p6.reheader.fasta.gzip.gz\033[0m' (No such file or directory)
hs37d5_GRCm38p6.reheader.fasta.gz: cannot open `\033[38;5;9mhs37d5_GRCm38p6.reheader.fasta.gz\033[0m' (No such file or directory)
: cannot open `\033[m' (No such file or directory)

➜ for f in $(/bin/ls *.gz)
> file $f

hs37d5_GRCm38p6.fasta.gz: gzip compressed data, extra field
hs37d5_GRCm38p6.reheader.fasta.gz: gzip compressed data, extra field
hs37d5_GRCm38p6.reheader.fasta.gzip.gz: gzip compressed data, from Unix, last modified: Mon May  6 16:15:43 2019

➜ for f in *.gz
> file $f

hs37d5_GRCm38p6.fasta.gz: gzip compressed data, extra field
hs37d5_GRCm38p6.reheader.fasta.gz: gzip compressed data, extra field
hs37d5_GRCm38p6.reheader.fasta.gzip.gz: gzip compressed data, from Unix, last modified: Mon May  6 16:15:43 2019

With respect to filenames causing a problem, if filenames contained white spaces, $(ls) would pass that as separate inputs whereas * would glob it as one with the spaces escaped. See below:

➜ touch a "b c"

➜ for f in $(/bin/ls *)
> file $f

a: empty
b: cannot open `b' (No such file or directory)
c: cannot open `c' (No such file or directory)

➜ for f in *
> file $f

a: empty
b c: empty