Rename FASTA headers based on filename
3
0
Entering edit mode
5.6 years ago
SaltedPork ▴ 170

Hi

FASTA header looks like:

>1570-13.segment.flu1_PB2
>1570-13.segment.flu2_PB1
>1570-13.segment.flu3_PA

etc

Filenames looks like:

201301234.fasta

I want to have FASTA headers that looks like:

>201301234_PB2
>201301234_PB1
>201301234_PA

I have seen this answer: Change header of a Fasta file according to the file name How can I modify this to preserve the _PB2...?

bash • 3.6k views
ADD COMMENT
2
Entering edit mode
5.6 years ago
AK ★ 2.2k

Hi SaltedPork,

Try:

for i in $(ls *.fasta); do fname=$(basename ${i} .fasta); perl -pe "s/^>.+_/>${fname}_/g" ${fname}.fasta > reID_${fname}.fasta; done

Edited (better use *.fasta, see response from RamRS):

for i in *.fasta; do fname=$(basename ${i} .fasta); perl -pe "s/^>.+_/>${fname}_/g" ${fname}.fasta > reID_${fname}.fasta; done
ADD COMMENT
0
Entering edit mode

I'd recommend for i in *.fasta instead of for i in $(ls *.fasta) - the latter adds a sub-shell where a glob would suffice. Plus, ls can get unpredictable if customized and IIRC filenames can cause a problem with the ls sub-shell method too.

ADD REPLY
0
Entering edit mode

Thanks, RamRS!

ls can get unpredictable if customized and IIRC filenames can cause a problem with the ls sub-shell method too

Can you give some examples of this?

ADD REPLY
0
Entering edit mode

I have a heavily customized shell. My ls is an example. My LSCOLORS setting interferes with the filename here. See sample output:

➜ for f in $(ls *.gz)
> file $f

hs37d5_GRCm38p6.fasta.gz: cannot open `\033[0m\033[38;5;9mhs37d5_GRCm38p6.fasta.gz\033[0m' (No such file or directory)
hs37d5_GRCm38p6.reheader.fasta.gzip.gz: cannot open `\033[38;5;9mhs37d5_GRCm38p6.reheader.fasta.gzip.gz\033[0m' (No such file or directory)
hs37d5_GRCm38p6.reheader.fasta.gz: cannot open `\033[38;5;9mhs37d5_GRCm38p6.reheader.fasta.gz\033[0m' (No such file or directory)
: cannot open `\033[m' (No such file or directory)

➜ for f in $(/bin/ls *.gz)
> file $f

hs37d5_GRCm38p6.fasta.gz: gzip compressed data, extra field
hs37d5_GRCm38p6.reheader.fasta.gz: gzip compressed data, extra field
hs37d5_GRCm38p6.reheader.fasta.gzip.gz: gzip compressed data, from Unix, last modified: Mon May  6 16:15:43 2019

➜ for f in *.gz
> file $f

hs37d5_GRCm38p6.fasta.gz: gzip compressed data, extra field
hs37d5_GRCm38p6.reheader.fasta.gz: gzip compressed data, extra field
hs37d5_GRCm38p6.reheader.fasta.gzip.gz: gzip compressed data, from Unix, last modified: Mon May  6 16:15:43 2019

With respect to filenames causing a problem, if filenames contained white spaces, $(ls) would pass that as separate inputs whereas * would glob it as one with the spaces escaped. See below:

➜ touch a "b c"

➜ for f in $(/bin/ls *)
> file $f

a: empty
b: cannot open `b' (No such file or directory)
c: cannot open `c' (No such file or directory)

➜ for f in *
> file $f

a: empty
b c: empty
ADD REPLY
0
Entering edit mode

I see. Good points that I didn't think about. Thanks, RamRS.

ADD REPLY
3
Entering edit mode
5.6 years ago
bari.ballew ▴ 470

Just using bash:

for i in *fasta; do n="${i%.fasta}"; sed -i.bak "s/>[^_]\+/>$n/" $i; done

This loops over all files in the current directory that end with "fasta". For each file:

  1. n="${i%.fasta}" removes the .fasta file extension (can be generalized to any extension by using n="${i%.*}")
  2. sed "s/>[^_]\+/>$n/" matches a string in the file that starts with ">" and is followed by any character that's not an underscore, and replaces it with the filename minus extension found in the previous step. Depending on your requirements, you may want to tighten up this regex.
  3. The -i.bak part just tells sed to replace the string in the original file, but make a backup called <originalname>.bak.
ADD COMMENT
1
Entering edit mode
5.6 years ago

Using seqkit : replace

seqkit replace -p '.segment.flu1' -r '' <your_fasta_file>

Explanation

 replace = name/sequence by regular expression.

-p, --pattern string         search regular expression
-r, --replacement string     replacement. supporting capture variables
ADD COMMENT

Login before adding your answer.

Traffic: 2049 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6