What's wrong with computing?

What's wrong is that we are:
  • Using a bad universal data format.
  • Depending on a universe of tools that make this bad format seem like the best choice.
The bad universal format are text files. HTML, XML, JSON, and most programming languages are based on them. The universe of tools are all manner of utilities to create, search, process, edit, compile, compare, and store versions of them.

We need that universe of tools. But we need them for a better data format.

What's wrong with plain text, then?

It is fundamentally incongruous with the data we store. Almost all data is structured: HTML, XML, JSON, TOML are all ways to store structured data in text files. Programming languages are structured with complex grammars. Where we use binary formats, almost all of them store structured data. ZIP files, DOC files, PNG files, everything is structured.

The incongruity is in the use of in-band signaling to delineate data. We can signal start and end of data in two ways:
  • Length-prefixed encoding. The data is prefixed with a length field, then the exact specified number of bytes follows. Data content does not need to be escaped and finding the end of the data is trivial.
  • In-band signals. The length of data is not indicated in advance, instead it's terminated by a specific byte or a sequence of bytes. If you want to encode the terminator sequence as part of the data, it requires escaping.
Text files and their associated formats use in-band signals both to terminate lines and to terminate strings. This makes them extremely bad for including data of one format in another format – or even the same format. Examples:
  • The security and usability problems related to including JavaScript within HTML in a <script> tag.
  • Above, I could not write "<script>" - I had to write &lt;script&gt;. Conversely, that had to be written as &amp;lt;script&amp;gt; – and so on.
  • How do you include binary data in JSON? You base64-encode it, blowing up the size by 4/3.
  • Security problems related to strings and line termination in HTML, JS and JSON.
  • Ever tried including C++ code within C++ code - as in, a code generator? Or JavaScript within C++ code? Ha ha.
  • In SMTP, email content is terminated by a single dot. Any line in an email that actually contains a single dot must be escaped and unescaped in transmission.
  • In email, any line in the content that begins with "From" must be escaped. This escaping is not undone, so ">From" is visible to the recipient.
  • ...
I could go on and on with these security and usability problems, all with the same cause: the use of in-band signaling.

A better universal data format would be much like XML or JSON or TOML. These formats are actually designed for general purpose structured data, which is what we actually, almost always, want to store.

Except: it needs be binary and use length-prefixed encoding.

Then, we need a universe of tools, equally as powerful as the tools we have for textual files right now, to search, create, process, edit, compile, compare, and store versions of files in this universal data format.

The reason plain text seems "friendly" right now is simply the presence of all those tools. If we can settle on a universal binary format with length-prefixed encoding; and develop the associated tools; the new format and its toolset will be obviously superior and preferable to most everyone. The only problem we have right now is... no tools.

A candidate format could be ASN.1, but ASN.1 over-emphasizes saving every bit possible. This complicates the format so it's rife with security problems in decoders, and the complexity is an obstacle for the development of tools. In comparison, the SSH protocol does not emphasize saving every bit possible, and as a result is very straightforward to decode. For example, a string is a big-endian 32-bit length field followed by the bytes - encoded exactly the way you'd expect.

Perhaps we need something like JSON, encoded like SSH does it.

Comments

Popular posts from this blog

When monospace fonts aren't: The Unicode character width nightmare

Circumcision as an adult, part 1

Circumcision as an adult, part 2