Text Diff Algorithms: How Diff Tools Actually Work

Why Diff Matters

Every time you open a pull request, push a commit, or resolve a merge conflict, you are reading diff output. The colored lines — green for additions, red for deletions — shape how reviewers understand your changes. A good diff makes a one-line bug fix obvious. A bad diff makes a simple refactor look like you rewrote the entire file. The algorithm behind that output determines which interpretation your reviewers see.

Version control systems depend entirely on diff. Git does not store full copies of every file version — it stores the differences between versions and reconstructs files on demand. When you run git blame, the system traces diffs backward to attribute each line. When you run git merge, it applies diffs from one branch onto another. The quality of these operations depends directly on how accurately the diff algorithm identifies what actually changed.

Beyond code, diff algorithms power collaboration tools everywhere. Google Docs tracks changes between document versions. Wikipedia shows edit histories as diffs. Legal contract review tools highlight modifications between draft versions. Even DNA sequence alignment uses a variant of the same algorithms. Understanding how diff works gives you insight into all these systems and helps you write content that produces cleaner, more readable change histories.

The Longest Common Subsequence Problem

At its core, finding the difference between two texts is a Longest Common Subsequence (LCS) problem. Given two sequences, find the longest subsequence present in both. A subsequence does not need to be contiguous — it just needs to preserve the relative order. Once you find the LCS, everything not in it is either an insertion or a deletion. The diff output is essentially a visualization of what falls outside the LCS.

Computing the LCS exactly is expensive. The classic dynamic programming solution requires O(n*m) time and space, where n and m are the lengths of the two sequences. For two files with 10,000 lines each, that is 100 million cells in the comparison matrix. This works but gets slow for large files. Early diff tools used this approach directly, but modern ones optimize it heavily or use heuristics to avoid computing the full matrix.

The key insight that makes practical diff tools possible: most changes in real files are small relative to the file size. A 10,000-line file with a 5-line edit shares 9,995 lines in common. The algorithm does not need to solve the general LCS problem — it just needs to find the minimal edit script that transforms one version into the other. This reframing from "find what is the same" to "find the smallest set of changes" is what makes Myers and similar algorithms efficient in practice.

One important subtlety: the LCS is often not unique. Two files might have multiple longest common subsequences of the same length but with different alignments. This means there are multiple valid diffs for the same pair of inputs, and different algorithms may produce different (but equally correct) output. The choice of which valid diff to show is an aesthetic decision, not a mathematical one — and it matters for readability.

Myers Diff Algorithm: The Standard

Eugene Myers published "An O(ND) Difference Algorithm" in 1986, and it became the default algorithm for Unix diff, Git, and most modern diff tools. The key idea: instead of computing the full O(n*m) matrix, Myers only explores paths proportional to the number of differences (D) between the files. When files are similar (small D), the algorithm runs in nearly linear time. It only approaches quadratic time when the files are completely different.

Myers works by searching for the shortest edit script — the minimum number of insertions and deletions needed to transform one file into the other. It explores the edit graph (a grid where horizontal moves are deletions, vertical moves are insertions, and diagonal moves are matches) using a breadth-first approach. Each "round" finds all paths with exactly k edits, starting from k=0 and incrementing until a path reaches the end of both files.

The algorithm uses a clever trick called "snaking" — after each edit, it follows diagonal moves (matches) as far as possible for free. This means equal lines never count toward the edit distance. In practice, most file comparisons have long runs of identical lines with small clusters of changes, so the snake step does most of the work and the algorithm terminates quickly.

Git uses Myers as its default diff algorithm (git diff --diff-algorithm=myers). The output minimizes the total number of changed lines, which usually produces the most compact diff. However, minimal does not always mean readable — sometimes a slightly longer diff that preserves the logical structure of the code is easier to review. This is where alternative algorithms come in.

Patience Diff and Histogram Diff

Patience diff, invented by Bram Cohen (the creator of BitTorrent), takes a different approach. Instead of minimizing edit distance, it prioritizes anchoring on unique lines — lines that appear exactly once in each file. The idea is that unique lines are reliable landmarks (function signatures, class declarations) while repeated lines (empty lines, closing braces) are not. By building the alignment around landmarks first, patience diff often produces output that better matches the logical structure of the code.

The algorithm works in three steps. First, find all lines that appear exactly once in both files and match them. Second, use these matches as fixed anchor points and recursively solve the LCS problem in the gaps between anchors. Third, produce the diff output. Because unique lines tend to be semantically meaningful (a function header is unique, a closing brace is not), the alignment usually respects code structure even when Myers would not.

Histogram diff is a performance optimization of patience diff used in Git (git diff --diff-algorithm=histogram). It extends the patience approach to handle low-occurrence lines (not just unique ones) by counting line frequencies with a histogram. Lines that occur rarely in both files become candidate anchors, ordered by frequency. This handles real-world code better — where truly unique lines are rare but low-frequency lines (appearing 2-3 times) are plentiful.

When should you switch algorithms? If your diffs show function bodies "sliding" — where a deletion and insertion of similar functions causes the diff to misalign the opening and closing braces — try patience or histogram. You can set your default in Git with git config --global diff.algorithm histogram. Many developers find histogram produces more readable output for code changes, while Myers is better for prose or data files where structural landmarks are less clear.

Line-Level vs Character-Level vs Word-Level Diffs

Traditional diff operates on lines. Each line is treated as an atomic unit — either the whole line matches or the whole line is marked as changed. This works well for code where one logical change usually affects a full line. But for prose, renaming a single variable across a long line, or comparing JSON with reformatted whitespace, line-level diff produces noisy output that highlights entire lines when only a few characters changed.

Character-level diff treats each character as a unit of comparison. It can pinpoint the exact characters that were modified within a line. Most diff viewers use a two-stage approach: first compute the line-level diff to identify changed lines, then run a character-level diff within those lines to highlight the specific changes. This gives you both the big picture (which lines changed) and the fine detail (what exactly changed on each line).

Word-level diff splits text on word boundaries (spaces and punctuation) and compares word by word. This is particularly useful for prose, documentation, and natural language text where changes involve rewording sentences rather than changing individual characters. Git supports this with git diff --word-diff, which shows changes inline with {+added+} and [-removed-] markers instead of full line replacements.

The choice of granularity affects both readability and performance. Character-level diff on large files is expensive because the comparison units are much smaller (thousands of characters vs hundreds of lines). Word-level diff is a middle ground that works well for text but poorly for code (where identifiers are single "words" and whitespace is significant). Most modern diff viewers default to line-level with character-level highlighting within changed lines — a practical compromise.

Semantic Diffs and Structural Awareness

Standard diff tools treat files as sequences of text lines. They have no understanding of programming language syntax, so moving a function to a different position in the file looks like deleting it in one place and adding it in another — even though the function itself did not change at all. Semantic diff tools parse the AST (Abstract Syntax Tree) and compare program structure rather than text, which means they can detect moves, renames, and formatting changes as distinct from logical modifications.

Tools like difftastic, GumTree, and SemanticMerge parse both file versions into ASTs and then compute the minimum tree edit distance. This catches things text diff cannot: a function was moved but not changed, a variable was renamed throughout a method, parentheses were added without changing behavior, or code was reformatted. Each of these would show as noise in a text diff but as a single named operation in a semantic diff.

The tradeoff is speed and generality. Text diff works on any file — log files, configuration, prose, code in any language. Semantic diff requires a parser for each language and fails on files with syntax errors. It is also slower because parsing and tree comparison are more expensive than line comparison. For daily code review, most teams stick with text diff tools that have good heuristics (patience/histogram algorithm, syntax-aware highlighting) rather than full semantic analysis.

A practical middle ground gaining popularity: syntax-aware highlighting in traditional diffs. Tools like delta, diff-so-fancy, and GitHub's diff view parse the language to color the diff output with syntax highlighting, making it easier to read without actually changing the comparison algorithm. Some tools also detect that a change is "only whitespace" or "only a move" and annotate it, giving some semantic information without the cost of full AST comparison.

// Simple line-level diff implementation (Myers-inspired)
// Returns operations to transform 'original' into 'modified'

type DiffOp =
  | { type: 'equal'; line: string }
  | { type: 'insert'; line: string }
  | { type: 'delete'; line: string };

function lineDiff(original: string[], modified: string[]): DiffOp[] {
  const n = original.length;
  const m = modified.length;
  
  // Build the edit graph using dynamic programming
  const dp: number[][] = Array.from({ length: n + 1 }, () =>
    Array(m + 1).fill(0)
  );

  // Fill LCS table
  for (let i = 1; i <= n; i++) {
    for (let j = 1; j <= m; j++) {
      if (original[i - 1] === modified[j - 1]) {
        dp[i][j] = dp[i - 1][j - 1] + 1;
      } else {
        dp[i][j] = Math.max(dp[i - 1][j], dp[i][j - 1]);
      }
    }
  }

  // Backtrack to produce diff operations
  const ops: DiffOp[] = [];
  let i = n, j = m;

  while (i > 0 || j > 0) {
    if (i > 0 && j > 0 && original[i - 1] === modified[j - 1]) {
      ops.unshift({ type: 'equal', line: original[i - 1] });
      i--; j--;
    } else if (j > 0 && (i === 0 || dp[i][j - 1] >= dp[i - 1][j])) {
      ops.unshift({ type: 'insert', line: modified[j - 1] });
      j--;
    } else {
      ops.unshift({ type: 'delete', line: original[i - 1] });
      i--;
    }
  }

  return ops;
}

// Usage example
const before = ['function hello() {', '  console.log("hi");', '}'];
const after = ['function hello() {', '  console.log("hello world");', '}'];
const diff = lineDiff(before, after);
// Result: equal, delete "console.log("hi")", insert "console.log("hello world")", equal

Tips for Writing Diff-Friendly Code

Put each item on its own line when working with lists, arrays, imports, and function parameters. A trailing comma on the last item means adding a new entry only shows one green line instead of modifying the existing last line (to add a comma) plus adding the new one. This applies to array literals, object properties, function arguments, and import statements. Most formatters (Prettier, Black) support trailing commas for exactly this reason.

Keep lines short and focused on one statement. Long lines that combine multiple operations produce diffs where the entire line is highlighted even if only one part changed. If a line has a condition, an assignment, and a function call, any change to one part marks the whole line as modified. Splitting into multiple lines means only the relevant line lights up in the diff.

Avoid reformatting unrelated code in the same commit. If you change indentation across an entire file while also fixing a bug, the diff becomes nearly unreadable — every line appears changed. Use separate commits for formatting changes and logical changes. Many teams enforce this with commit hooks: run the formatter first and commit, then make your logical change and commit.

Use meaningful variable and function names that are stable across refactors. If you rename a function that is called in 50 places, the diff for each calling file will show a change. Diff tools cannot distinguish "the name changed but the behavior is identical" from "this is a completely different function." Stable, well-chosen names from the start reduce the noise of rename-heavy refactors in your version history.