How Diff Works — Finding Changes with the Longest Common Subsequence

Diff is the mechanism that compares two texts and shows "what was added and what was deleted." We look at diffs almost every day — in Git commits, code reviews, and proofreading prose. Behind the scenes, many implementations rely on an idea called the Longest Common Subsequence (LCS). This article walks through the LCS that underlies diff with a small dynamic-programming example, and lays out precisely how additions and deletions are derived from it.

The bottom line first: most diffs first compute the longest sequence that can remain common to both inputs (the LCS). If the common part is treated as "unchanged," then the elements that remain only on one side are deletions, and the elements that appear only on the other side are additions — they fall out naturally. The LCS can be computed as an O(mn) table with dynamic programming.

1. What a diff is — line-level and character-level differences

A diff compares two texts, the before (old) and the after (new), and expresses their differences as additions and deletions (and unchanged parts). The word can refer to the "process" or to its "output."

There are mainly two choices for the smallest unit of comparison.

In both cases, the question at the root is the same: "which parts can be kept common across the two sequences?" Maximizing the common part minimizes what must be displayed as a change, producing a diff that is easy for humans to read. Finding this "longest part that can remain common" is exactly what the LCS, described next, does.

2. The idea behind the Longest Common Subsequence (LCS)

The LCS (Longest Common Subsequence) is the longest sequence you can form by choosing elements that appear in the same order in both sequences. The key point is that it need not be contiguous.

In diff, a text is regarded as a "sequence of elements (lines or characters)," and its LCS is used as "the largest skeleton that can remain common to both." For instance, the longer the LCS of two sequences of lines, the more lines can stay common without moving, and the smaller the diff becomes.

Through "maximizing the common part," the LCS leads to "minimizing the change." The basic idea is that if you maximize the elements that can remain common, the remainder (the additions and deletions) is minimized, giving the most concise diff.

3. Computing the LCS with dynamic programming

The LCS can be found with dynamic programming (DP). For two sequences of length m and n, build an (m+1)×(n+1) table L, and define L[i][j] as "the length of the LCS of the first i elements of the first sequence and the first j elements of the second." The recurrence is as follows.

Let us check with a small example. We compute the LCS of the strings X = ABCBDAB (length 7) and Y = BDCAB (length 5). Filling the table with the recurrence above, the bottom-right cell L[7][5] gives the LCS length.

 BDCAB
000000
A000011
B011112
C011222
B011223
D012223
A012233
B012234

The bottom-right value is 4. That is, the LCS length of ABCBDAB and BDCAB is 4. Actual sequences include BCAB and BDAB (both appear in the same order in both strings). The time complexity is the table size itself, O(mn), and the memory of a naive implementation is also O(mn).

The LCS is not necessarily unique. As in the example above, it is common for there to be more than one common subsequence of length 4. The reason diff tools can "show the same change differently depending on the implementation" stems from this freedom in how the LCS is chosen.

4. Deriving additions and deletions from the LCS

From the table that gives the LCS length, you can construct the actual diff by tracing the path back (backtracking). Start at the bottom-right cell and move toward the top-left with the following rules.

  1. If the trailing elements at the current cell are equal, that element is part of the LCS (i.e., unchanged). Move to the upper-left cell (i-1, j-1).
  2. If they are not equal, move toward the larger of L[i-1][j] and L[i][j-1]. Moving up means the element on the X side is a deletion; moving left means the element on the Y side is an addition.
  3. Stop when you reach an edge of the table (i=0 or j=0). Reversing the order you traced gives the diff sequence.

In other words, elements in the LCS = "unchanged," kept on both sides; elements only on the X side = deletions; elements only on the Y side = additions. The small example below shows the correspondence.

Old (X)New (Y)Diff marker
appleapple (unchanged)
banana- deletion
blueberry+ addition
cherrycherry (unchanged)

Here apple and cherry are the LCS (the lines that remain common); banana exists only in the old, so it is a deletion, and blueberry exists only in the new, so it is an addition. This is the essence of the familiar - / + diff display.

5. Line-level diff and the Myers algorithm

The naive LCS DP uses O(mn) in both time and memory. For files with thousands or tens of thousands of lines, this tends to become heavy. So in practical diff, algorithms that are fast when the size of the difference (the edit distance) is small are preferred.

A representative one is the Myers algorithm. It solves the problem of finding the "shortest edit script (the minimum number of additions and deletions)" as a shortest-path search on an edit graph. With the amount of difference denoted D (the number of additions plus deletions), it runs in roughly O((m+n)·D), and its strength is being very fast when there are few changes. Git and many other tools adopt algorithms of this family as the default for line-level diff.

Myers and LCS are not different problems but the same problem viewed from different angles. "Maximizing the common part (LCS)" and "minimizing the number of edits (shortest edit script)" are two sides of the same coin, and the optimal solutions coincide. It helps to see Myers as a technique that finds that optimal solution efficiently on real data where the difference is small.

6. Uses — version control, code review, and proofreading

Diff is used widely, from software development to everyday writing.

In every case, the root is the LCS idea of "maximizing the part that can remain common and surfacing only what changed." Knowing the mechanism also makes it clearer why a diff display sometimes is not what you expected (the freedom in how the LCS is chosen, or the difference between line-level and character-level).

Free Tool Compare for real with the Text Diff tool Just paste two texts and additions and deletions are highlighted on the spot. It runs entirely in your browser, and your input is not sent anywhere.

Frequently Asked Questions (FAQ)

What is a diff?

A diff is the process of comparing two texts to show what was added and what was deleted, or the output of that process. Many implementations compute the Longest Common Subsequence (LCS) and, using the parts that remain common to both as a baseline, mark lines or characters that exist only on one side as deletions and those that exist only on the other side as additions. Diffs are used every day in version control and code review.

What is the LCS (Longest Common Subsequence)?

The LCS (Longest Common Subsequence) is the longest sequence of elements that appears in the same order in both sequences, chosen in a form that need not be contiguous. Unlike a substring, which requires contiguity, you may skip over gaps. With dynamic programming it can be computed in an O(mn) table for sequences of length m and n, and diff builds its output on top of this LCS as the common base.

How do line-level and character-level diff differ?

They differ in the smallest unit of comparison. Line-level diff treats one line as a single element and suits text where lines carry meaning, such as source code and configuration files. Character-level diff treats one character as an element and suits small changes within a single line or word-level differences in prose with few line breaks. Many tools use a two-stage approach: compare by line first, then highlight only within changed lines at the character level.

← Back to the Tech Blog list