I make pairwise alignment with ./water
form the emboss package.
I want all fasta identifier of the sequences which align with a similarity of > 60%. I can find no option to set a threshold for the identity, similarity or the alignment score. I can parse the output, but maybe you have any tips here, or I have overlooked something? Additional, the output does not contain the whole .fasta names of the match, only a part of it like:
output from ./water
:
# Aligned_sequences: 2
# 1: 344TS28_contig130193
# 2: 1648TS28_contig03869
# Matrix: EDNAFULL
# Gap_penalty: 5.0
# Extend_penalty: 0.5
#
# Length: 732
# Identity: 287/732 (39.2%)
# Similarity: 287/732 (39.2%)
# Gaps: 424/732 (57.9%)
# Score: 765.5
fasta identifier in sequence:
>gene_1|GeneMark.hmm|344TS28_contig130193
...
Is this normal, or a bug (maybe it will be correct, when I replace the |
)? Thank you in advance!
Edit: If I increase the costs for gap, the output will contain only sequences with a higher similarity, but I can't detect any fixed score-threshold.
Edit2: The problems with the fasta identifiers will go away when I replace all |
with _
(sed -i 's/|/_/g' seq.fasta
)
For 'fasta' format input EMBOSS will detect and parse a range of identifier formats, for example:
to extract the entry identifier. In these examples 'db' or 'DB' refers to the database name, 'acc' or 'ACC' is the entry accession, 'id' or 'ID' is the entry identifier or name, and 'Des' is the description.
If you do not want the fasta format header to be parsed to extract this information, but instead to use basic fasta format parsing where the first token on the header line is the identifier, in all cases then force the use of the 'pearson' format handling for the input by using '-sformat pearson' in your command line.
For more information about the handling of inputs in EMBOSS see the EMBOSS User Manual.