Reinventing wheels: A better, sliding-matrix diff algorithm
I didn't want to implement the Myers algorithm because:
- I find it hard to understand. I'd need a better understanding to implement it.
- It uses recursion. This is unfit for production. I would need to understand the algorithm even better to fix this.
- It has worst-case performance
O(NM). This is unfit for production. I would need to understand it completely, inside-out, to figure out how to implement a compromise between optimality and performance.
- My capable colleague already attempted all of the above, suggesting a different approach is worthwhile.
I built on the core idea from my simple diff, the unique unit map, to devise a sliding matrix algorithm:
- Instead of a full matrix like above, it calculates a partial matrix along a diagonal, where an optimal solution is likely. The maximum width of the matrix is easily tunable at run-time as a quality/performance tradeoff.
- The most desirable path through the matrix is calculated from the bottom-right corner, working left and up toward the start. Once the top-left corner is reached, the algorithm simply follows the calculated best path.
- Since the sliding matrix has limited width, performance is linear at
O(max(N,M))for large inputs.
- Any units (lines) that appear only in one of the inputs are trivially excluded and are treated as attachments to axis units. The matrix focuses on the real problem, which is how to match up the units that are present in both inputs.
- Calculating the matrix makes it easy to take preferences into account, for example a preference to group additions and removals together, rather than having them alternate. These preferences can be easily tweaked at run-time.
The C++ implementation takes 0.2 seconds to compare two 8 MB inputs, each with 100,000 lines. The corresponding memory requirement is 40 MB. This performance is representative and cannot be substantially worsened by finding special, contrived inputs. Such contrived inputs would instead worsen the quality of the solution found.
I published the implementation under a MIT license as part of Atomic. The following are the most relevant files:AtDiff.h – header
AtDiff.cpp – implementation
AutDiff.cpp – test harness
The implementation does need the rest of Atomic to build as-is, but anyone who wants to can adapt it.Edit: As of November 17, 2019, this now implements the Myers algorithm, which is more effective, more efficient, and more straightforward. The new implementation has no need of the nifty HTML debug output, as well as most of the other toggles and switches. I continue to discuss the previous implementation below.
Example of the algorithm at workCommand line - using toy settings for demonstration:
AtUnitTest.exe diff -s xabcdefghijklmnopqrstuvwxyzb yabcdopqrstghijklmnefuvwxwvutsrqponyza -fmc 400 -smw 20 -pns -dbg diff.htmlThis does the following:
-s("simple") diffs the strings provided on the command line, pretending each letter is a line. Otherwise, the main input parameters would be paths to input files.
-fmc 400("full matrix cells") limits maximum full matrix size to 400 cells (a toy number). This forces use of a sliding matrix despite the input being very small.
-smw 20("sliding matrix width") limits the width of the sliding matrix to 20 cells (a toy number). This forces parts of the matrix to remain uncalculated.
-pns("prefer no split") disables preprocessing that would otherwise split the input on the longest common substring to calculate multiple smaller matrices instead of one large one.
-dbg diff.htmlproduces HTML debug output and saves it to diff.html.
To see how and why the output deteriorates if the sizes of inputs and the number of changes are too large for the sliding matrix, set a smaller sliding matrix width, for example
-smw 10. But keep the parcel splitter disabled using
-pns- if it's enabled, it will cut the input in two parts before calculating two smaller matrices, again resulting in good output.
EpilogueIn the end, this hodge-podge worked fine for complex small inputs and simple large inputs, but it generated decidedly suboptimal output for complex large inputs (e.g. 100,000 lines). The reason is that the longest common substring is simply not a good heuristic to split on; and I couldn't find a good heuristic to guide a sliding matrix instead, either.
Having implemented all of this, I found I had developed an understanding of the problem such that I was no longer discouraged from implementing the Myers algorithm. So I implemented that, and it works great for all inputs. :)