Reinventing wheels: A straightforward, two-pass diff algorithm

For over two decades of writing software, I was scared of writing a diff algorithm. It seemed so formidable, like something an impressively bearded inventor of Unix might do, while the rest of us are doomed to fail and suffer in our cluelessness.

If you want to be discouraged, just look it up! Instead of finding a simple algorithm anyone could implement, you'll find that diffs are a special case of the Longest Common Subsequence problem. You'll learn this is a difficult problem. It's NP-hard. You'll find solutions that are mathematical and arcane and recursive, with horrible worst-case performance. There will be suggestions of wisdom along these lines:
How to put N grains of sugar in coffee:
- If N=0, stop algorithm.
- Otherwise, pick first grain of sugar. Mix it with coffee.
- Repeat for remaining N-1 grains.
When I needed a diff algorithm, on two separate occasions, in two separate decades, I hired two separate people to do it. They studied the same literature, found the same theory about how diffs are super hard, and implemented algorithms that kinda worked on small inputs, with catastrophic worst-case performance. The algorithms took 30+ seconds where WinDiff miraculously takes a split-second to produce perfectly good results, like it was written by a Wizard of Redmond.

We deployed one of these algorithms and the worst-case performance affected our software in practice. So I decided to throw off my self-imposed limits, delve into it, and write a simple and decent diff. I came up with one, and it works.

Since I didn't actually study diff algorithms, I don't know how naive or well-known my solution is, or how it compares to the state of the art. What I know is that it works consistently fast, produces a good diff, and can work with all types of input.

Since it seems hard to find a useful description of a decent algorithm – this is the diff à la denis:
  1. Preliminaries:
    1. Define your unit of input. This is usually a line. The algorithm is the same if the unit is a byte or a token.
    2. If your input unit is a line, avoid doing a lot of copying. In C++, you might convert a whole text file into std::vector<std::string_view>, one entry per line. In other languages, you might have similar ways to represent substrings without copying them.
    3. If you want the algorithm to be able to ignore character case, or to ignore leading and trailing space, then in addition to a view of each line, also store for each line its canonical representation (e.g. lowercase or with whitespace removed). Then do the diff over the canonical representation instead of the source text.
  2. Build a unique unit map:
    1. For each input unit (e.g. a line) in the old and new files, create or update an entry in the unique unit map. In C++, this might be an std::map (not multimap). In .NET, it might be a Dictionary.
    2. The keys of the unique unit map are the values of each unique unit (e.g. a line or its canonization).
    3. The values of the unique unit map contain two ordered sets for each unique unit:
      • The positions in the old file where this unit (line) appears. We will remove them as we go.
      • The positions in the new file where this unit (line) appears. We will remove them as we go.
  3. You'll need tracker objects to track your current position in the old and new files:
    1. A tracker will have information such as your current position in the file (e.g. offset or line number), and the entry in the unique unit map associated with that line.
    2. A tracker needs to support an advance operation which moves it to the next unit. Just before the tracker moves, it needs to update the unique unit map entry associated with the unit it was just at, and remove that unit's position (in either the old or new file, whichever is advancing) from the entry in the unique unit map.
    3. This way, the positions the unique unit map keeps track of are only ever positions we have not yet processed. We don't need old ones, since after building the unique unit map, we only make one pass. We never backtrack.
  4. Initialize one tracker for the old file and one for the new file, each pointing to the first line. Now scan the files:
    1. As long as the unique unit map entry for the current unit in the old file shows that unit (line) does not appear in the new file: add a diff entry indicating that unit is removed, then advance the old file tracker to the next unit.
    2. Check the new file tracker. If it's at end of file, then the old file tracker is also at end. If so, we are done.
    3. Otherwise, check the unique unit map entry for the current unit in the new file. If the entry shows that unit (line) does not appear in the old file: add a diff entry indicating that unit is added, then advance the new file tracker.
    4. Otherwise, check if the new file tracker and the old file tracker point to the same entry in the unique unit map. If so: add a diff entry indicating the unit is unchanged, then advance both the old and new file tracker.
    5. Otherwise, the current unit (line) in the old and new files are different. In this case:
      1. The current unit in the new file also appears in the old file, but at a later position than we're currently at. We'll call this the old file target position. This position can be readily read in the unique unit map.
      2. Also, the current unit in the old file appears in the new file, but at a later position than we're currently at. We'll call this the new file target position. This position can be readily read in the unique unit map.
      3. We can either advance the old file tracker by one unit, adding a diff entry indicating the unit is removed; or we can advance the new file tracker, adding a diff entry indicating the unit is added.
      4. A reasonable heuristic is to advance over the unit that has a greater number of remaining occurrences. If the units are the same in this regard, then advance the file that has the closer target. Note this is the part where we pay for simplicity with often suboptimal results.
    6. Continue the scan loop at 4.a.
Edit – November 5, 2019: Thanks to feedback, greatly simplified step 4.e, removing what was basically a no-op.

That's it. It works, it has consistent (and good) performance, and implementing it fully with testing took me 5 hours.

This has space complexity O((N+M)*log(N+M)) – accounting for overhead of map storage – and time complexity O((N+M)*log(N+M)): two linear passes, the first that does map insertions and the second that does map lookups.

The algorithm is not ideal. For example: in "abcabba => cbabac", it finds the common sequence "abc" instead of "cbba" or "baba". In "xaxcxabc => abcy", it finds the common sequence "ac" instead of "abc". That's due to simplicity in step 4.e.

After implementing this, I found Robert Elder's excellent description of the Myers diff algorithm. If you want ideal diff output, you want to do something like that. But this is where the NP-hard nature of the problem becomes apparent. The complexity of an optimized version of that is O(min(len(a),len(b)) * D), where D is the number of differences. If the diff inputs are directly or indirectly under an attacker's control, they can force a situation where len(a) ~= len(b) ~= D, reducing performance to O(N2) with a potentially large N. This can allow for denial of service. To combat this, you must compromise the ideal output for performance.

Subsequent work

In testing, we found this algorithm to be tolerable, but not perfect. I then devised a better, sliding-matrix algorithm, which worked excellently for small inputs, but I couldn't find a good heuristic to slide the matrix with complex large inputs.

In the end, I learned enough to implement Myers, which turns out to be effective, efficient, and deceptively simple. :)

Comments

Unknown said…
In any way different that well-known algorithm of https://en.wikipedia.org/wiki/Diff which is based on https://en.wikipedia.org/wiki/Edit_distance ???
denis bider said…
Yes, diff actually attempts to solve the longest common subsequence problem. This doesn't. This is a much simpler algorithm, meant for situations where the inputs are potentially under an attacker's control. You might use this algorithm if you care about simplicity and worst-case performance more than finding the ideal output.

Popular posts from this blog

When monospace fonts aren't: The Unicode character width nightmare

Circumcision as an adult, part 1

Circumcision as an adult, part 2