Question: What are the recommended PARAMETERS for the local protein alignment and how important they are ?
As far as I understand BLOSSUM62 is typical choice for the substitution matrix.
Here in example below the params "opening gap penalty -4, extension penalty -1" are used - is it typical choice ? (For example chatGPT suggests 11,1 - used by NCBI blastp - that is true - see bottom page https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins and open "Algorithm parameters" ).
Link to the example: https://www.kaggle.com/code/shtrausslearning/biopython-bioinformatics-basics?scriptVersionId=128567832&cellId=59
PS (From ChatGPT )
Question: what are the recommended parameters for the local protein alignment ?
ChatGPT
The choice of parameters for local protein alignment depends on the specific algorithm or tool being used. However, I can provide you with some commonly recommended parameters for local protein alignment using the popular tool BLAST (Basic Local Alignment Search Tool) as an example:
Word Size: This parameter specifies the length of the word used as a basis for the initial seed match. The default value is typically 3, but increasing it to 4 or 5 can improve sensitivity at the cost of slower performance.
Expectation (E) Value: The E-value threshold determines the significance level at which alignments are reported. Lower E-values indicate more significant alignments. Commonly used values for local protein alignments range from 0.01 to 0.001, depending on the desired stringency.
Gap Costs: Gap costs determine the penalties assigned for gap opening and gap extension. Typically, there are two types of gap costs: a gap opening cost and a gap extension cost. The values for these parameters depend on the scoring system used and the characteristics of the sequences being aligned. Default values are commonly used, but they can be adjusted based on specific needs.
Scoring Matrix: The scoring matrix is used to assign scores to different residue substitutions during the alignment process. Popular choices include the BLOSUM (Blocks Substitution Matrix) and PAM (Point Accepted Mutation) matrices. The specific matrix and its version depend on the nature of the sequences being aligned.
Question: what are typical choices for the gap costs ?
ChatGPT
The choice of gap costs for local protein alignment depends on the specific alignment algorithm or tool being used. However, two common types of gap costs are the gap opening cost and the gap extension cost.
In general, the gap opening cost is set higher than the gap extension cost because opening a gap is considered a more significant event than extending an existing gap. This reflects the biological expectation that introducing a gap requires more energy or evolutionary pressure.
For local protein alignment using tools like BLAST or BLASTP, the default gap opening and extension costs are often as follows:
Gap Opening Cost: 11 Gap Extension Cost: 1
These default values work well for many applications, but they can be adjusted based on specific requirements and the nature of the sequences being aligned. Increasing the gap opening cost relative to the extension cost makes the algorithm less likely to introduce long gaps, which can be useful when aligning closely related sequences or when looking for short conserved motifs. Conversely, decreasing the gap opening cost relative to the extension cost allows for longer gaps and can be helpful when aligning more distantly related sequences.
PSPS
Some other experiments with alignments:
BioPython: https://www.kaggle.com/code/alexandervc/cafa5-18-alignments-biopython-compare https://www.kaggle.com/code/alexandervc/cafa5-18-alignments-biopython-compare
Skbio: https://www.kaggle.com/code/alexandervc/cafa5-19-alignments-skbio
does this mean you understand pident now and are moving forward?
Trying to do so ))
did you see my updated post? once we get past that, id be willing to tackle this.
Yes, thank you, biostars does not send notifications on updates, so I found it later. So basically we need to take minimum - mismatch, there is small descrepancy from gapopen (if it is non-zero), but for the moment it can be neglected.