Fasta header trimming for multiple delimiters
4
0
Entering edit mode
7.1 years ago
kor272 • 0

I am relatively new to Linux, and I have read through this post: Fasta header trimming , but it does not quite solve my problem.

This is the format of the sequences in my file:

>sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1

.. followed by the amino acid sequence.

I would like the format to be:

>P48347

+ sequence

As you can see, there are multiple delimiters, and I'm struggling to extract the characters I want correctly.

So far, my code is:

$ cut -d ' ' -f 1 | cut -d '|' -f 2 example.fasta > out.fasta

Which outputs:

P48347

+ sequence

I considered using sed to add the ">" back, but this seems a bit messy. I have also tried awk, but I am confused by how to use it with multiple delimiters and fasta format.

How do I extract the unique identifier in the header (P48347), without losing the '>' at the beginning?

Thanks in advance.

fasta bash • 2.7k views
ADD COMMENT
1
Entering edit mode
7.1 years ago
5heikki 11k
awk 'BEGIN{FS="|"}{if(/^>/){print ">"$2}else{print $0}}' input > output
ADD COMMENT
0
Entering edit mode

Thanks, this works perfectly!

ADD REPLY
1
Entering edit mode
7.1 years ago
bioplanet ▴ 60

Also in perl (if you want):

perl -e 'while(<>) {if($_=~/^.*?\|(.*?)\|/) {$id=$1; print ">$id\n";}}'
ADD COMMENT
1
Entering edit mode
7.1 years ago
Joe 21k

Pure bash alternative:

#!/bin/bash
# usage:
# $ bash extract_header_field.sh seqs.fasta

while read line ; do
        if [ ${line:0:1} == ">" ] ; then
                IFS='|' read -a header <<< "$line"
        else
                seq="$line"
        echo -e ">${header[1]}""\n""$seq"
        fi
done < $1

As a more general note, you can change this script to split a fasta up and retrieve any field you like by changing the IFS='|' part to whatever "internal field separator" you like (e.g. IFS=',').

Then just change the number in the line ...${header[1]}... to whatever chunk you like.

In this case, >sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1 there are 3 | symbols, so the elements of the array $header become:

>sp   # "${header[0]}"
P48347   # "${header[1]}"
14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1   # "${header[2]}"

(remember that its 0-based indexing)

ADD COMMENT
1
Entering edit mode
7.1 years ago

Output with sequence:

$ sed '/^>/ s/\(>\).*|\(P[0-9]\+\)|.*/\1\2/' test.fa

Output with sequence:

>P48347
atgc
>P48348
tgac

Output only headers:

 $ sed -n '/^>/p' test.fa | sed 's/\(>\).*|\(P[0-9]\+\)|.*/\1\2/'

Output only headers:

>P48347
>P48348

input:

>sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1
atgc
>sp|P48348|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1
tgac
ADD COMMENT

Login before adding your answer.

Traffic: 2497 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6