Try SeqKit, a cross-platform and ultrafast toolkit for FASTA/Q file manipulation.
Just download the binary executable files and run.
Fake sequences:
$ cat seqs.fa
>lcl|DS180868.1_cds_EDM56185.1_1 [protein=putative lipopolysaccharide biosynthesis glycosyltransferase] [protein_id=EDM56185.1] [location=complement(116..658)]
TCAGCAAACGGTCTCAGACAATGCGATTATCACCATTAAAGGCACACCGACAAAAGCTGGTGAATCTGGTGCAGCACTTA
>lcl|DS000000.1_cds_EDM00000.1_1 [protein=another protein] [protein_id=EDM56185.1] [location=complement(116..658)]
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
>lcl|DS011111.2_cds_EDM01111.1_1 [protein=putative another protein 2] [protein_id=EDM56185.1] [location=complement(116..658)]
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
Just ONE command with SeqKit.
By default, seqkit sort
sorts sequences by sequence identifier.
For this case, we use --id-regexp "protein=(.+?)\]"
to specify sequence IDs.
$ seqkit sort --id-regexp "protein=(.+?)\]" seqs.fa
[INFO] read sequences ...
[INFO] 3 sequences loaded
[INFO] sorting ...
[INFO] output ...
>lcl|DS000000.1_cds_EDM00000.1_1 [protein=another protein] [protein_id=EDM56185.1] [location=complement(116..658)]
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
cccccccccccccccccccc
>lcl|DS011111.2_cds_EDM01111.1_1 [protein=putative another protein 2] [protein_id=EDM56185.1] [location=complement(116..658)]
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttt
>lcl|DS180868.1_cds_EDM56185.1_1 [protein=putative lipopolysaccharide biosynthesis glycosyltransferase] [protein_id=EDM56185.1] [location=complement(116..658)]
TCAGCAAACGGTCTCAGACAATGCGATTATCACCATTAAAGGCACACCGACAAAAGCTGG
TGAATCTGGTGCAGCACTTA
If you have large number of sequences, use --two-pass
mode which has lower memory usage:
$ seqkit sort --id-regexp "protein=(.+?)\]" --two-pass seqs.fa
See more detail of seqkit sort
.
------------[update]---------------
Protein names could by duplicated, for example, "hypothetical protein".
Therefore, storing records with map
/hash
/dict
is not the best way,
which needs unique keys. You need to create new unique names for sorting
and keep the original names for output.
A better method is:
- Converting FASTA format to tab-delimited table.
- Creating a new column containing the protein names.
- Sorting by the new column with
sort
or other command
- Converting back to FASTA format.
Here's an implemention combining SeqKit and csvtk.
FASTA format was 1) converted to 3-column tabular format (header, sequence, quality (blank for FASTA)), and 2) protein names in FASTA header column (1th) was extracted as a new (4th) column. The table was then 3) sorted (in memory) by 4th column and 4) converted back to FASTA format.
cat seqs.fa | seqkit fx2tab | csvtk -H -t mutate -p "protein=(.+?)\]" -n protein | csvtk -H -t sort -k 4 | seqkit tab2fx > sorted_by_protein.fa
Fake sequences (with duplicated protein names)
$ cat seqs.fa
>id1 [protein=putative glycosyltransferase]
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>id2 [protein=hypothetical protein]
ccccccccccccccccccccccccccccccccccccccccccccc
>id3 [protein=putative another protein 2]
ttttttttttttttttttttttttttttttttttttttttttttt
>id4 [protein=hypothetical protein]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Result
$ head sorted_by_protein.fa
>id4 [protein=hypothetical protein]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>id2 [protein=hypothetical protein]
ccccccccccccccccccccccccccccccccccccccccccccc
>id3 [protein=putative another protein 2]
ttttttttttttttttttttttttttttttttttttttttttttt
>id1 [protein=putative glycosyltransferase]
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
it is running perfectly but out put is coming with out header
Are you working with a Windows-sourced text file with weird line endings, maybe? Is your input a true single-line FASTA file? Might be worth it to double-check your inputs. Perhaps troubleshoot the value of
header
in theelse
block with aprint header;
statement.After changing header into single-line form the code is running perfectly
thank you Alex
Just curious, do you have FASTA records where there could be duplicates of protein names?
yes, obviously. therefore storing records with map/hash/dict is not the best way.
There are a number of ways to deal with collisions in hash tables that all work, but
awk
is probably not the best tool for that, which is why I ask.