How to Compare Two Texts
How text comparison and diffing works, line-by-line vs word-by-word modes, and when to use each type of diff.
What is a text diff?
A text diff is a compact representation of what changed between two versions of a text. Rather than showing both versions in full, a diff highlights only the parts that differ — what was added, what was removed, and what stayed the same. The unchanged portions are shown as context, giving you enough surrounding text to understand where each change sits.
The concept originated with the Unix diff utility, created in 1974. It became the backbone of version control: every commit in Git is essentially a diff. Today, text diffing appears in code review tools, collaborative documents, legal redlining workflows, content management systems, and countless other places where tracking what changed matters.
UtilYard has two complementary diff tools: Diff Checker for line-by-line comparison (the classic Git-style view) and Text Diff for word-by-word comparison (more like Track Changes in a document editor). Which one to use depends on what you're comparing.
Line-by-line vs word-by-word
These two modes answer slightly different questions.
Line-by-line diffing treats each line as an atomic unit. A line is either present in both versions (unchanged), present only in the old version (removed), or present only in the new version (added). This is how git diff works, and how most code review interfaces show changes. It's excellent for structured text — source code, configuration files, CSV data, log files — where lines have inherent meaning and the structure of the file matters as much as the content.
The downside of line-by-line diffing is that it can be misleading for prose. If a paragraph is edited to fix one word, the entire line (or multiple lines, if the paragraph was rewrapped) shows as removed and re-added. The signal-to-noise ratio is poor when the actual change was three characters in the middle of a sentence.
Word-by-word diffing (sometimes called inline diff) breaks the text into individual words and compares those. This is analogous to "Track Changes" in Microsoft Word or Google Docs's "Suggest edits" mode. A word-by-word diff will show you exactly which words were deleted and exactly which words were inserted, without flagging surrounding unchanged words. For prose — blog posts, documentation, legal copy, marketing text — this mode is dramatically more readable.
The right choice depends on your content: use line-by-line for code and structured data, word-by-word for natural language and documents.
How diff algorithms work
The engine behind most diff tools is the Longest Common Subsequence (LCS) algorithm. The intuition is straightforward: given two sequences (of lines or words), find the longest subsequence of elements that appear in both, in the same order. Everything in the LCS is "unchanged." Everything in the first sequence that isn't in the LCS was removed. Everything in the second sequence that isn't in the LCS was added.
Consider a tiny example comparing two word sequences:
Old: The quick brown fox jumps over the lazy dog New: The quick red fox leaps over the sleeping dog LCS: The quick fox over the dog Removed: brown, jumps, lazy Added: red, leaps, sleeping
The original Unix diff used an LCS-based algorithm. In practice, most modern tools use Myers' diff algorithm, published in 1986, which is equivalent to LCS but expressed as finding the shortest edit script — the smallest number of additions and deletions needed to transform one sequence into the other. Git uses a variation of Myers' algorithm by default, with an optional patience diff or histogram diff mode that can produce cleaner output for certain types of code changes.
One subtlety: the LCS is not always unique. Different valid LCS solutions produce different diffs, which is why two diff tools can compare the same inputs and produce output that looks different even though both are technically correct. Myers' algorithm tends to produce diffs that humans find natural and easy to read, which is one reason it became the standard.
Reading a diff
Diff output follows a consistent visual language regardless of the tool:
- —Unchanged lines — shown without any prefix or highlighting. These are the context lines that help you orient yourself within the document.
- —Removed content — typically shown in red, often with a minus (
−) marker. This content existed in the original text and is absent from the new version. - —Added content — typically shown in green, often with a plus (
+) marker. This content is new in the revised version and was not in the original.
In a line-by-line diff, a "modification" to a line always appears as a removal followed by an addition — there is no "changed" concept at the line level. This means a single-character fix to a long line shows the entire old line as red and the entire new line as green. Word-by-word diffs improve on this by highlighting only the changed tokens within the line, leaving the surrounding words in their neutral colour.
In unified diff format (used by Git and the terminal diff command), changes are grouped into hunks — blocks of nearby changes with a few lines of surrounding context. The hunk header looks like @@ -12,7 +12,9 @@, telling you which line in the old file the hunk starts at and how many lines it spans, and the equivalent for the new file.
Common use cases
Text diffing is useful well beyond software development:
- —Comparing document versions — when a client returns an edited contract or brief without tracked changes enabled, a word-by-word diff immediately shows you every word that changed. This is much faster than reading both documents in parallel.
- —Reviewing edited copy — copyeditors, content managers, and translators can use a diff to verify that revisions touched only the intended passages and didn't accidentally alter other sections of a long document.
- —Checking configuration files — comparing a server's live config against a known-good baseline, or comparing staging vs production settings, is a common ops task where a line-by-line diff is invaluable.
- —Spotting accidental changes — a diff quickly reveals whether a "find and replace" changed things it shouldn't have, or whether a template substitution left placeholder text intact somewhere.
- —Verifying data exports — comparing two exports of the same dataset from different dates (or different systems) shows exactly which rows were added, removed, or modified between them.
Frequently asked questions
- Which diff mode should I use for source code?
- Line-by-line. Code has meaningful line structure — a single line usually corresponds to a statement, declaration, or clause — so line-level granularity maps well to how programmers think about changes. Word-by-word can be useful for reviewing a specific changed line in detail, but as a primary view for code it tends to produce noisy output because code tokens are short and numerous.
- Which diff mode should I use for prose or documents?
- Word-by-word. Natural language sentences span entire lines, and most edits change only a few words within a sentence. A line-by-line diff of prose will show entire paragraphs as removed and re-added for what was actually a one-word substitution. Word-by-word diffing gives you the same clarity that Track Changes provides in a word processor.
- Why do two diff tools sometimes show different results for the same input?
- When there are multiple valid ways to describe the differences between two texts, different algorithms make different choices about which LCS to use as the baseline. The total number of additions and deletions is the same, but which lines are labelled as added vs removed can vary. This is most noticeable when large blocks of text are moved rather than just edited — some algorithms identify the move, others show it as a deletion in one place and an insertion in another.
- What is a context line in a diff?
- Context lines are unchanged lines that appear around a changed region to help you understand where in the document the change occurs. The default in Git is three context lines on each side of a change. You can increase this with git diff -U10 (10 lines of context) or suppress it entirely with git diff -U0. In web-based diff tools, context is usually shown automatically.