Parsing a portion of a fasta file header
3
0
Entering edit mode
4.6 years ago
Frieda ▴ 60

I have a big fasta file, which contains multiple headers followed by sequences. A small part of the file is shown below:

>XP_024503355.1 Integrator complex subunit 11 [Strongyloides ratti] >CEF64154.1 Integrator complex subunit 11 [Strongyloides ratti]
MKIVCLGAGCDVGRSCILLKIGNKRVMLDCGIHMGYEDEQKFPDFSFIANGNSLTEYIDCVLISHFHTDHSAALPYMTEVIGYDGPIYMTQPTKAISAVLLEDFRKITTQQRGEKSFFTSEMIKSCLKKVHVIELHEIVHVDEDLTIQAFYAGHVIGAVMFLIKVGNE

>XP_021327764.1 GTPase IMAP family member 8-like, partial [Danio rerio]
MASALVTQVMVAILVAQPSQFSTEKAKVAFVVNLLSGNTALWGSTVRDQKLPCCESFTTFTEKLKKVFDRAASGRESADFFKKEEKVHSYQTSLVNLPALTRLSEDEVMNQTLNC

>XP_025088316.1 U2 small nuclear ribonucleoprotein B''-like [Pomacea canaliculata] >PVD33693.1 hypothetical protein C0Q70_04953 [Pomacea canaliculata] 
MSLQPSHTIYINNLNEKIKKDELKKSLYAIFSQFGQILDIVALKTLKMRGQAFVIFKEINSAANALRSMQGFPFYDKPMRIQFSKKDSDIIAKMKGTYVEGE

>XP_006213240.1 lecithin retinol acyltransferase isoform X1 [Vicugna pacos] >XP_031525980.1 lecithin retinol acyltransferase isoform X1 [Vicugna pacos] >XP_031525983.1 lecithin retinol acyltransferase isoform X1 [Vicugna pacos] 
MKNPMLEAVSLVLEKLLLISNFKLFSSGTPDENKARTTYSVNSFLRGDVLEVPRTNFTHYGIYLGDNRVAHMMPDILLALTDDKGLTQKVVSNKRLVLGVIVKVASIRVDTVEDFAYGADILVNHLDKSLKKKALLNEEVAQRAEKL

I want to extract all the accession numbers which are followed by >, but I do not want the whole header. Also, as it can be seen, in one header there can be a second accession number (like in the first line).

The output should look like this:

 >XP_024503355.1

 >CEF64154.1

 >XP_021327764.1

 >XP_025088316.1

 >PVD33693.1

 ....

I have already tried the following, but this command returns the whole header:

grep -e ">" filename.fas

Is there any bash command which I can use to only extract the accession numbers followed by > without including the rest?

bash fasta parse linux terminal • 1.0k views
ADD COMMENT
0
Entering edit mode

In case you can use a GUI-based software, I would recommend you to have a look at SEDA (https://www.sing-group.org/seda/), our open source application for processing FASTA files containing DNA and protein sequences. The online manual (https://www.sing-group.org/seda/manual/index.html) provides detailed descriptions of all operations, including the 'Rename header' (https://www.sing-group.org/seda/manual/operations.html#rename-header), which is specially designed to rearrange and extract information from sequence headers.

ADD REPLY
3
Entering edit mode
4.6 years ago
GenoMax 147k
$ grep ">" filename.fas | tr " " "\n" | grep ">"
>XP_024503355.1
>CEF64154.1
>XP_021327764.1
>XP_025088316.1
>PVD33693.1
ADD COMMENT
0
Entering edit mode

Another way would be grep -Eo ">[^ ]+" file, although I'm not sure how that will respond to multi-matches. On second thought, yours is better and simpler.

ADD REPLY
3
Entering edit mode
4.6 years ago
grep -o '^>[^ \t]*' file.fasta
ADD COMMENT
3
Entering edit mode
4.6 years ago
Mensur Dlakic ★ 28k

grep ">" file.fas | awk '{print $1}'

>XP_024503355.1
>XP_021327764.1
>XP_025088316.1
>XP_006213240.1

Ar you sure that you need > characters? If not:

grep ">" file.fas | perl -p -e 's/\>//g' | awk '{print $1}'

XP_024503355.1
XP_021327764.1
XP_025088316.1
XP_006213240.1
ADD COMMENT
0
Entering edit mode

There are more than one fasta ID's in some headers. OP wants those too.

ADD REPLY

Login before adding your answer.

Traffic: 1787 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6