How to compare protein sequences?

In the last post I described an analysis that depends on comparing protein sequences. There are two different ways to do the comparison; and I need to decide which is more appropriate for our analysis.

Both methods rely on first aligning the amino acids in the proteins to be compared (here we'll only be comparing two proteins at a time), and then comparing the amino acids at each aligned position. The goal of the alignment is to align amino acids that are homologous - that is, those that are similar because of descent from the same position in the ancestral sequence both proteins evolved from. (If the proteins are not themselves homologous the analysis can't be done.)

In studies of evolutionary relationships, the usual method of comparison is to simply count how many of the positions have identical amino acids, giving a "% identity" score. In studies of protein function, scores based on the functional similarity of the aligned amino acids are often used. These rely on a matrix that gives similarity scores of all pairwise combinations of amino acids. These matrix scores are themselves derived from comparisons of large numbers of aligned amino acids, giving highest scores to amino acids which most often have evolved to perform the same role in a protein. For example, valine and leucine are commonly found in homologous positions, and matrices give this pair a high score. Wikipedia gives a good explanation of the use of matrices to compare protein sequences.

I'll post later about the issues raised by our analysis problem.

