Sequence Alignment

Lining up with scores instead of counts

Sequence alignment generalises edit distance and LCS. Instead of counting operations, it assigns a score to every column of an alignment: a reward for a match, a penalty for a mismatch, and a penalty for a gap where one sequence has a character and the other has a blank. The goal is the alignment with the best total score. This framing is central in computational biology for comparing DNA and proteins.

The dynamic program

Like edit distance, alignment fills a table over prefixes. Each cell chooses the best of three moves:

Align the two current characters, adding a match or mismatch score from the diagonal.
Gap in one sequence, coming from the left with a gap penalty.
Gap in the other sequence, coming from above with a gap penalty.

The global version, known as Needleman Wunsch, optimises over the whole strings. A local variant, Smith Waterman, adds a zero floor so the alignment can start fresh anywhere, finding the best matching region rather than forcing the ends to line up.

Gap penalties

Real biology often uses an affine gap model, charging more to open a gap than to extend one, since a single long insertion is more plausible than many scattered ones. This needs a slightly richer recurrence that tracks whether a gap is currently open.

Key idea

Sequence alignment scores matches, mismatches, and gaps and fills a prefix table choosing the best of align, gap left, or gap up; global alignment spans whole strings while a zero floor yields the best local region.

Sequence Alignment

Lining up with scores instead of counts

The dynamic program

Gap penalties

Key idea

Check yourself