Question

Fasta header trimming for multiple delimiters

0

Entering edit mode

7.1 years ago

kor272 • 0

I am relatively new to Linux, and I have read through this post: Fasta header trimming , but it does not quite solve my problem.

This is the format of the sequences in my file:

>sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1

.. followed by the amino acid sequence.

I would like the format to be:

>P48347

+ sequence

As you can see, there are multiple delimiters, and I'm struggling to extract the characters I want correctly.

So far, my code is:

$ cut -d ' ' -f 1 | cut -d '|' -f 2 example.fasta > out.fasta

Which outputs:

P48347

+ sequence

I considered using sed to add the ">" back, but this seems a bit messy. I have also tried awk, but I am confused by how to use it with multiple delimiters and fasta format.

How do I extract the unique identifier in the header (P48347), without losing the '>' at the beginning?

Thanks in advance.

fasta bash • 2.7k views

ADD COMMENT • link updated 7.1 years ago by cpad0112 21k • written 7.1 years ago by kor272 • 0

score 1 · Answer 1 · 2017-11-20

1

Entering edit mode

7.1 years ago

5heikki 11k

awk 'BEGIN{FS="|"}{if(/^>/){print ">"$2}else{print $0}}' input > output

ADD COMMENT • link 7.1 years ago by 5heikki 11k

0

Entering edit mode

Thanks, this works perfectly!

ADD REPLY • link 7.1 years ago by kor272 • 0

GenoMax · Answer 2 · 2017-11-20

1

Entering edit mode

7.1 years ago

bioplanet ▴ 60

Also in perl (if you want):

perl -e 'while(<>) {if($_=~/^.*?\|(.*?)\|/) {$id=$1; print ">$id\n";}}'

ADD COMMENT • link updated 7.1 years ago by GenoMax 148k • written 7.1 years ago by bioplanet ▴ 60

score 1 · Answer 3 · 2017-11-20

Pure bash alternative:

#!/bin/bash
# usage:
# $ bash extract_header_field.sh seqs.fasta

while read line ; do
        if [ ${line:0:1} == ">" ] ; then
                IFS='|' read -a header <<< "$line"
        else
                seq="$line"
        echo -e ">${header[1]}""\n""$seq"
        fi
done < $1

As a more general note, you can change this script to split a fasta up and retrieve any field you like by changing the IFS='|' part to whatever "internal field separator" you like (e.g. IFS=',').

Then just change the number in the line ...${header[1]}... to whatever chunk you like.

In this case, >sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1 there are 3 | symbols, so the elements of the array $header become:

>sp   # "${header[0]}"
P48347   # "${header[1]}"
14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1   # "${header[2]}"

(remember that its 0-based indexing)

score 1 · Answer 4 · 2017-11-20

Output with sequence:

$ sed '/^>/ s/\(>\).*|\(P[0-9]\+\)|.*/\1\2/' test.fa

Output with sequence:

>P48347
atgc
>P48348
tgac

Output only headers:

 $ sed -n '/^>/p' test.fa | sed 's/\(>\).*|\(P[0-9]\+\)|.*/\1\2/'

Output only headers:

>P48347
>P48348

input:

>sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1
atgc
>sp|P48348|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1
tgac