Introduction
A subsequence is a sequence you can derive from another by deleting zero or more elements without changing the order of remaining elements. For example, "ACE" is a subsequence of "ABCDE".
The Longest Common Subsequence (LCS) problem asks: given two strings, find the longest subsequence present in both.
For strings "ABCDE" and "ACE", the LCS is "ACE" with length .
You'll encounter LCS in:
- Diff tools: Git uses LCS to show file changes
- Bioinformatics: Comparing DNA sequences
- Spell checkers: Measuring string similarity
Intuition
Let's trace through strings "ABCD" and "AEBD".
You start by comparing characters from the end. If the last characters match, they must be part of the LCS. If they don't, you try two options: skip one character from the first string, or skip one from the second.
Compare 'D' and 'D'. They match. Include 'D' in the LCS. Now solve for "ABC" and "AEB".
Compare 'C' and 'B'. They don't match. You check both: LCS("AB", "AEB") and LCS("ABC", "AE"). Take the longer result.
Continue recursively until you reach empty strings.
The key insight: when characters match, include them. When they don't, branch into two subproblems and pick the better one.
Algorithm
Define as the length of the LCS for the first characters of string and the first characters of string .
Base cases:
- for all (empty first string)
- for all (empty second string)
Recurrence relation:
If :
Otherwise:
When characters match, you extend the previous LCS by . When they don't, you take the maximum of excluding either character.
Implementation
function lcs(s1, s2):
m = length(s1)
n = length(s2)
dp = array of size (m+1) x (n+1), initialized to 0
for i from 1 to m:
for j from 1 to n:
if s1[i-1] == s2[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
return dp[m][n]
Reconstructing the LCS string:
Start at . If characters match, prepend that character and move diagonally to . If they don't match, move to whichever neighbor has the larger value. Stop when you reach or .