I have a big fasta file, which contains multiple headers followed by sequences. A small part of the file is shown below:
>XP_024503355.1 Integrator complex subunit 11 [Strongyloides ratti] >CEF64154.1 Integrator complex subunit 11 [Strongyloides ratti]
MKIVCLGAGCDVGRSCILLKIGNKRVMLDCGIHMGYEDEQKFPDFSFIANGNSLTEYIDCVLISHFHTDHSAALPYMTEVIGYDGPIYMTQPTKAISAVLLEDFRKITTQQRGEKSFFTSEMIKSCLKKVHVIELHEIVHVDEDLTIQAFYAGHVIGAVMFLIKVGNE
>XP_021327764.1 GTPase IMAP family member 8-like, partial [Danio rerio]
MASALVTQVMVAILVAQPSQFSTEKAKVAFVVNLLSGNTALWGSTVRDQKLPCCESFTTFTEKLKKVFDRAASGRESADFFKKEEKVHSYQTSLVNLPALTRLSEDEVMNQTLNC
>XP_025088316.1 U2 small nuclear ribonucleoprotein B''-like [Pomacea canaliculata] >PVD33693.1 hypothetical protein C0Q70_04953 [Pomacea canaliculata]
MSLQPSHTIYINNLNEKIKKDELKKSLYAIFSQFGQILDIVALKTLKMRGQAFVIFKEINSAANALRSMQGFPFYDKPMRIQFSKKDSDIIAKMKGTYVEGE
>XP_006213240.1 lecithin retinol acyltransferase isoform X1 [Vicugna pacos] >XP_031525980.1 lecithin retinol acyltransferase isoform X1 [Vicugna pacos] >XP_031525983.1 lecithin retinol acyltransferase isoform X1 [Vicugna pacos]
MKNPMLEAVSLVLEKLLLISNFKLFSSGTPDENKARTTYSVNSFLRGDVLEVPRTNFTHYGIYLGDNRVAHMMPDILLALTDDKGLTQKVVSNKRLVLGVIVKVASIRVDTVEDFAYGADILVNHLDKSLKKKALLNEEVAQRAEKL
I want to extract all the accession numbers which are followed by >, but I do not want the whole header. Also, as it can be seen, in one header there can be a second accession number (like in the first line).
The output should look like this:
>XP_024503355.1
>CEF64154.1
>XP_021327764.1
>XP_025088316.1
>PVD33693.1
....
I have already tried the following, but this command returns the whole header:
grep -e ">" filename.fas
Is there any bash command which I can use to only extract the accession numbers followed by > without including the rest?
In case you can use a GUI-based software, I would recommend you to have a look at SEDA (https://www.sing-group.org/seda/), our open source application for processing FASTA files containing DNA and protein sequences. The online manual (https://www.sing-group.org/seda/manual/index.html) provides detailed descriptions of all operations, including the 'Rename header' (https://www.sing-group.org/seda/manual/operations.html#rename-header), which is specially designed to rearrange and extract information from sequence headers.