Reinventing wheels: A better, sliding-matrix diff algorithm

A week ago, I posted about a simple diff algorithm which works, produces tolerable results, but is ultimately a toy. I wanted an algorithm that produces excellent results while keeping consistent worst-case performance.

I didn't want to implement the Myers algorithm because:
  • I find it hard to understand. I'd need a better understanding to implement it.
  • It uses recursion. This is unfit for production. I would need to understand the algorithm even better to fix this.
  • It has worst-case performance O(NM). This is unfit for production. I would need to understand it completely, inside-out, to figure out how to implement a compromise between optimality and performance.
  • My capable colleague already attempted all of the above, suggesting a different approach is worthwhile.
I was inspired by this excellent visualization from Robert Elder's page about the Myers diff:


I built on the core idea from my simple diff, the unique unit map, to devise a sliding matrix algorithm:
  • Instead of a full matrix like above, it calculates a partial matrix along a diagonal, where an optimal solution is likely. The maximum width of the matrix is easily tunable at run-time as a quality/performance tradeoff.
  • The most desirable path through the matrix is calculated from the bottom-right corner, working left and up toward the start. Once the top-left corner is reached, the algorithm simply follows the calculated best path.
  • Since the sliding matrix has limited width, performance is linear at O(max(N,M)) for large inputs.
  • Any units (lines) that appear only in one of the inputs are trivially excluded and are treated as attachments to axis units. The matrix focuses on the real problem, which is how to match up the units that are present in both inputs.
  • Calculating the matrix makes it easy to take preferences into account, for example a preference to group additions and removals together, rather than having them alternate. These preferences can be easily tweaked at run-time.

The C++ implementation takes 0.2 seconds to compare two 8 MB inputs, each with 100,000 lines. The corresponding memory requirement is 40 MB. This performance is representative and cannot be substantially worsened by finding special, contrived inputs. Such contrived inputs would instead worsen the quality of the solution found.

I published the implementation under a MIT license as part of Atomic. The following are the most relevant files:

AtDiff.h – header
AtDiff.cpp – implementation
AutDiff.cpp – test harness

The implementation does need the rest of Atomic to build as-is, but anyone who wants to can adapt it.

Edit: As of November 17, 2019, this now implements the Myers algorithm, which is more effective, more efficient, and more straightforward. The new implementation has no need of the nifty HTML debug output, as well as most of the other toggles and switches. I continue to discuss the previous implementation below.

Example of the algorithm at work

Command line - using toy settings for demonstration:
AtUnitTest.exe diff -s xabcdefghijklmnopqrstuvwxyzb
    yabcdopqrstghijklmnefuvwxwvutsrqponyza
    -fmc 400 -smw 20 -pns -dbg diff.html
This does the following:
  • -s ("simple") diffs the strings provided on the command line, pretending each letter is a line. Otherwise, the main input parameters would be paths to input files.
  • -fmc 400 ("full matrix cells") limits maximum full matrix size to 400 cells (a toy number). This forces use of a sliding matrix despite the input being very small.
  • -smw 20 ("sliding matrix width") limits the width of the sliding matrix to 20 cells (a toy number). This forces parts of the matrix to remain uncalculated.
  • -pns ("prefer no split") disables preprocessing that would otherwise split the input on the longest common substring to calculate multiple smaller matrices instead of one large one.
  • -dbg diff.html produces HTML debug output and saves it to diff.html.
Screenshot of HTML debug output - click to enlarge:


To see how and why the output deteriorates if the sizes of inputs and the number of changes are too large for the sliding matrix, set a smaller sliding matrix width, for example -smw 10. But keep the parcel splitter disabled using -pns - if it's enabled, it will cut the input in two parts before calculating two smaller matrices, again resulting in good output.

Epilogue

In the end, this hodge-podge worked fine for complex small inputs and simple large inputs, but it generated decidedly suboptimal output for complex large inputs (e.g. 100,000 lines). The reason is that the longest common substring is simply not a good heuristic to split on; and I couldn't find a good heuristic to guide a sliding matrix instead, either.

Having implemented all of this, I found I had developed an understanding of the problem such that I was no longer discouraged from implementing the Myers algorithm. So I implemented that, and it works great for all inputs. :)

Comments

Boris Kolar said…
Clever. The flaws of this algoritm is when you have two files with long differences at the beginning and the end, which will prevent finding potentially nice diff in the middle.

How about finding the longest common substring(s) (which is O(N*log(N)) with some clever tricks) and then recursively applying the algorithm to remaining segments?
denis bider said…
> The flaws of this algoritm is when you have two files with long differences at the beginning and the end

It actually works surprisingly well in many such circumstances: if the long differences are additions or removals such that the same lines appear only in the source, or only in the destination, they are trivial to ignore (in the post: "Any units (lines) that appear only in one of the inputs are trivially excluded and are treated as attachments to axis units.").

The real problem arises if N lines are duplicated, or M lines are deduplicated (but not completely removed), such that |N-M| > (sliding matrix width) / 2. In that case the diff output is suboptimal because it doesn't find a common run that would require going outside the calculated part of the matrix.

> How about finding the longest common substring(s) (which is O(N*log(N)) with some clever tricks)

What is the algorithm for doing that in O(N*log(N))? As far as I know, there aren't better solutions than O(N*D), where D is the number of differences. But the number of differences can be equal to N, which makes the worst case performance O(N^2).

> and then recursively applying the algorithm to remaining segments?

Avoiding recursion, or controlling it, is a big reason for designing this algorithm. :) Our previous algorithm was based on Myers and used recursion, but then it either had O(N^2) performance on certain common inputs, or it could overflow the stack on certain other inputs. Performance needs to be consistent and predictable.
denis bider said…
> How about finding the longest common substring

Oh, substring! Not subsequence. Yeah, I'm doing that. See below:


> The real problem arises if N lines are duplicated, or M lines are deduplicated

Also and more importantly, when groups of lines are transposed (e.g. moving a function from near the bottom to near the top), such that the cumulative length of transposed sequences exceeds (maximum width ) / 2.

As a result, the sliding matrix width has to be fairly large - e.g. 1000 to accommodate up to 500 transposed lines. The performance is acceptable (e.g. 1 second to diff two files of 100,000 lines).

Nevertheless, it's a waste to calculate the full matrix for a large input that contains a handful of small changes; especially if this is frequent and you have to do the calculation every time. For this reason I'm working on some additional tricks. One is to detect a long common substring on which the input can be split (I believe that's your suggestion). Another is to move the fixed-width spotlight in the matrix in a smarter way.
denis bider said…
With regard to longest common substring:

Wow, I found this great explanation of the Ukkonen algorithm to build a suffix tree. This looks like it would solve the longest common substring problem more generally than I'm doing right now, but it does seem to be complex.

Right now I'm finding the longest common substring with a simple trick: I already have the unique unit map, so I can just look for what I call "keystone units", i.e. lines that appear exactly once in both input and output. It's trivial to find common substrings that contain such lines. Unfortunately this doesn't generalize: if you simply duplicate every line in the old and new inputs, there are no longer any "keystone units" and the algorithm doesn't work. But that's not a typical scenario I expect.
denis bider said…
Epilogue:

In the end, this hodge-podge worked fine for complex small inputs and simple large inputs, but it generated decidedly suboptimal output for complex large inputs (e.g. 100,000 lines). The reason is that the longest common substring is simply not a good heuristic to split on; and I couldn't find a good heuristic to guide a sliding matrix instead, either.

Having implemented all of this, I found I had developed an understanding of the problem such that I was no longer discouraged from implementing the Myers algorithm. So I implemented that, and now it works for all inputs. :)

Popular posts from this blog

When monospace fonts aren't: The Unicode character width nightmare

Circumcision as an adult, part 1

Circumcision as an adult, part 2