Question

Parsing a portion of a fasta file header

0

Entering edit mode

5.2 years ago

Frieda ▴ 60

I have a big fasta file, which contains multiple headers followed by sequences. A small part of the file is shown below:

>XP_024503355.1 Integrator complex subunit 11 [Strongyloides ratti] >CEF64154.1 Integrator complex subunit 11 [Strongyloides ratti]
MKIVCLGAGCDVGRSCILLKIGNKRVMLDCGIHMGYEDEQKFPDFSFIANGNSLTEYIDCVLISHFHTDHSAALPYMTEVIGYDGPIYMTQPTKAISAVLLEDFRKITTQQRGEKSFFTSEMIKSCLKKVHVIELHEIVHVDEDLTIQAFYAGHVIGAVMFLIKVGNE

>XP_021327764.1 GTPase IMAP family member 8-like, partial [Danio rerio]
MASALVTQVMVAILVAQPSQFSTEKAKVAFVVNLLSGNTALWGSTVRDQKLPCCESFTTFTEKLKKVFDRAASGRESADFFKKEEKVHSYQTSLVNLPALTRLSEDEVMNQTLNC

>XP_025088316.1 U2 small nuclear ribonucleoprotein B''-like [Pomacea canaliculata] >PVD33693.1 hypothetical protein C0Q70_04953 [Pomacea canaliculata] 
MSLQPSHTIYINNLNEKIKKDELKKSLYAIFSQFGQILDIVALKTLKMRGQAFVIFKEINSAANALRSMQGFPFYDKPMRIQFSKKDSDIIAKMKGTYVEGE

>XP_006213240.1 lecithin retinol acyltransferase isoform X1 [Vicugna pacos] >XP_031525980.1 lecithin retinol acyltransferase isoform X1 [Vicugna pacos] >XP_031525983.1 lecithin retinol acyltransferase isoform X1 [Vicugna pacos] 
MKNPMLEAVSLVLEKLLLISNFKLFSSGTPDENKARTTYSVNSFLRGDVLEVPRTNFTHYGIYLGDNRVAHMMPDILLALTDDKGLTQKVVSNKRLVLGVIVKVASIRVDTVEDFAYGADILVNHLDKSLKKKALLNEEVAQRAEKL

I want to extract all the accession numbers which are followed by >, but I do not want the whole header. Also, as it can be seen, in one header there can be a second accession number (like in the first line).

The output should look like this:

 >XP_024503355.1

 >CEF64154.1

 >XP_021327764.1

 >XP_025088316.1

 >PVD33693.1

 ....

I have already tried the following, but this command returns the whole header:

grep -e ">" filename.fas

Is there any bash command which I can use to only extract the accession numbers followed by > without including the rest?

bash fasta parse linux terminal • 1.3k views

ADD COMMENT • link updated 5.2 years ago by Mensur Dlakic ★ 29k • written 5.2 years ago by Frieda ▴ 60

0

Entering edit mode

In case you can use a GUI-based software, I would recommend you to have a look at SEDA (https://www.sing-group.org/seda/), our open source application for processing FASTA files containing DNA and protein sequences. The online manual (https://www.sing-group.org/seda/manual/index.html) provides detailed descriptions of all operations, including the 'Rename header' (https://www.sing-group.org/seda/manual/operations.html#rename-header), which is specially designed to rearrange and extract information from sequence headers.

ADD REPLY • link 5.2 years ago by Hugo ▴ 400

score 3 · Accepted Answer · 2020-04-29

3

Entering edit mode

5.2 years ago

GenoMax 152k

$ grep ">" filename.fas | tr " " "\n" | grep ">"
>XP_024503355.1
>CEF64154.1
>XP_021327764.1
>XP_025088316.1
>PVD33693.1

ADD COMMENT • link 5.2 years ago by GenoMax 152k

0

Entering edit mode

Another way would be grep -Eo ">[^ ]+" file, although I'm not sure how that will respond to multi-matches. On second thought, yours is better and simpler.

ADD REPLY • link 5.2 years ago by Ram 45k

score 3 · Accepted Answer · 2020-04-29

3

Entering edit mode

5.2 years ago

Pierre Lindenbaum 166k

grep -o '^>[^ \t]*' file.fasta

ADD COMMENT • link 5.2 years ago by Pierre Lindenbaum 166k

score 3 · Accepted Answer · 2020-04-29

3

Entering edit mode

5.2 years ago

Mensur Dlakic ★ 29k

grep ">" file.fas | awk '{print $1}'

>XP_024503355.1
>XP_021327764.1
>XP_025088316.1
>XP_006213240.1

Ar you sure that you need > characters? If not:

grep ">" file.fas | perl -p -e 's/\>//g' | awk '{print $1}'

XP_024503355.1
XP_021327764.1
XP_025088316.1
XP_006213240.1