2016-04-20

On the desirable properties of a programming language

For the past 15 years, I've wanted to create a programming language. It is easy enough to do so, if one just wants to create something. But I've wanted to create something that improves in some substantial way on what already exists, and stands the test of time.

10 years ago, I created Flow. Flow is not widely known, but it's the asynchronous architecture that underpins Bitvise SSH Server and SSH Client. It's a message-passing architecture based on small, Erlang-like processes, implemented using fibers in C++. One of the main purposes of Flow was to make it easier to reason about a complex, asynchronous application, and I do believe it succeeded to an extent. Its grander motivation was to make it trivial to write complex applications that scale effortlessly to any number of parallel processors.

I believe that Flow would meet this second goal, but its ability to do so is yet untested. We do not actually need the type of scaling Flow would provide in our software. We use Flow inside traditional thread-based parallelization which works better with the architecture of our SSH Server. I have since written software such as a web server with an integrated database, and that doesn't need scaling on a component granularity either. While Flow has free scaling, it requires a shift in the programmer's perspective. Traditional thread-based approaches require no shift, and tend to work well enough.

I originally wanted to turn Flow into a language. I realize, now, that I could; but this language would have problems inherent in all languages that try to impose some way of thought on the programmer. By imposing this way of thought, the language alienates people to whom it is foreign. Secondly, the intentional restrictions of the language make it difficult to implement designs that people actually want to use. The language then has a choice: either condemn and not help with these patterns, so the language becomes obscure; or smuggle violations of its principal design through a backdoor.

And now, as prime example of backdoor smuggling of design violations, I present:

Haskell


Haskell is known as a pure functional programming language which is actually so well designed that it can run nearly as fast as C++. Or so we're told.

To achieve C-like speeds in Haskell, you have to brutalize the beautiful language by treating it, well, like C.

Exhibit one: Jan Stolarek's article where he presents the following algorithm that takes a list of machine-wide integers, and computes the sum of squares of odd numbers in it:
 
  sumSqrL :: [Int] -> Int
  sumSqrL = sum . map (^2) . filter odd

What a piece of beautiful code. But according to Jan's measurements, it performs 8 times slower than C. He fixes this by reimplementing the algorithm as follows:
 
  sumSqrPOp :: U.Vector Int -> Int
  sumSqrPOp vec = runST $ do
    let add a x = do
          let !(I# v#) = x
              odd#     = v# `andI#` 1#
          return $ a + I# (odd# *# v# *# v#)
    foldM` add 0 vec -- replace ` with ' here

This is not only practically unreadable; it is practically imperative. It looks like it would look in C++, except that C++ is actually designed for this, and doesn't make it look so ugly.

If this is not sufficiently convincing, consider the problem of efficient string concatenation in Haskell. Of course, Haskell can append to a string:
 
  xs `append` x = xs ++ [x]

The only problem is that, when repeated, this has quadratic performance. A string, you see, is represented as a list. So each time you append a single character, Haskell creates a new list; or if not that, at least traverses the whole existing list; and then appends the additional character.

I suggest exploring the Haskell solutions proposed in the above Stack Overflow link. Let me know if you think any of them are clear. In Haskell, string concatenation is an advanced topic.

Managed languages


The main shortcoming of managed languages; with prime examples .NET and Java; is that they require management infrastructure, which can't be written in the managed language. You can't write a garbage collector in a language that assumes there is a garbage collector, and in which you can't access memory.

The way I see it, managed languages are the wrong approach to solving the problem of reliability in software. Imagine there is a surgeon. As part of their job, the surgeon needs to cut people. The surgeon's hand is steady, but when cutting freehand with a scalpel, the surgeon sometimes makes mistakes.

To reduce the number of scalpel mistakes, we invent cutting templates. In the types of surgeries that are most common, the surgeon can now use a cutting template instead of using the scalpel freehand. This improves the safety of most surgeries, and is generally seen as a good thing.

C is like a surgeon without a cutting template. C++ is like a surgeon who has the choice of a cutting template, or a scalpel. Managed languages are like surgeons being treated like they're inept. You only get cutting templates; the scalpel is taken away. If the problem doesn't fit the templates – well – you can try to mangle it until it does.

Managed languages solve memory safety issues that were a problem in the 1990s and 00s, but are mostly gone in modern C++. What managed languages do not do is protect against incompetent programmers, who remain free to implement basic vulnerabilities to SQL injection and cross-site forgeries. Managed languages do not allow you to rely on programmers whom you couldn't trust with C++.

C++


So what's wrong with C++?

Not really that much. This:
 

  // Signed -> Signed

  template <class OutType, class InType,
    std::enable_if_t<(
      std::numeric_limits<InType >::is_signed &&
      std::numeric_limits<OutType>::is_signed &&
      sizeof(InType) > sizeof(OutType)
    ), int> = 0>
  inline OutType NumCast(InType value)
  {
    // Input type is larger
    EnsureThrow(value >= std::numeric_limits<OutType>::min());
    EnsureThrow(value <= std::numeric_limits<OutType>::max());
    return (OutType) value;
  }

  template <class OutType, class InType,
    std::enable_if_t<(
      std::numeric_limits<InType >::is_signed &&
      std::numeric_limits<OutType>::is_signed &&
      sizeof(InType) <= sizeof(OutType)
    ), int> = 0>
  inline OutType NumCast(InType value)
  {
    // Input type is NOT larger
    return (OutType) value;
  }

The above are two out of eight template functions that use SFINAE to implement NumCast - a mechanism we use to ensure at run-time that the value of one type of integer can be safely converted to another.

I show this to emphasize the clunkiness of C++ metaprogramming. The entire language is not so much designed, as it is undesigned. Each individual step in the language's evolution was carefully chosen. But from a bird's eye view, C++ ends up looking like a hack, built upon a hack, built upon a hack. These hacks mostly fit well together; there is actual beauty in their outcome, and satisfaction in using the language. But the above is not an example of intuitive metaprogramming.

Go? Rust?


I like Rust. Rust is smart. It has the obvious feature of rigid single-reference memory safety. I suspect this may be an "overarching idea" problem, and may require smuggling of design violations by the back door.

"Go" seems like a language you design if you want everything simple, and have high developer turnover. It's a systems language for when your developers are from JavaScript, and don't want to learn C++.

Where to go?


The kind of language I think is worthy of creating:
  • Can express any type of program. Not just a garbage collected program, or just a user-mode program; but also a kernel-mode program, a boot loader, or an embedded real-time program.
  • Does not impose limiting paradigms which then need to be broken through a backdoor. No functional Fata Morgana, no object-oriented Fata Morgana, no message-passing Fata Morgana. The language should allow these concepts to be expressed, but not foist them.
  • Has regular structure and strong built-in metaprogramming. This is a lesson that can be learned from LISP (hint hint hint). The main problem with LISP is all the parentheses.
It might not surprise that I have in mind a design for just such a language. :) In a future post, I would like to introduce Seed – a list-based, self-building front-end language, with a simple but powerful basic structure, from which a versatile platform can be built for any purpose. It will first be intended to target the LLVM intermediate representation; but there's no reason it could not also emit .NET or Java bytecode, platform-specific machine code, HTML, PDF, or text files.

Seed is also intended to not have subtly incompatible implementations, which continue to be an issue with C++. The idea is for most of the language to be defined in the language itself. Implementations are then much more likely to be compatible, as they use the same language definition.

And if you don't like the language? Change it. Most of the language will be defined in the language itself, so its definition is under your control.

2016-04-16

"Hello world" in LLVM IR language without CRT

The LLVM Language Reference Manual has a "Hello world" example that uses the C run-time library.

I wanted a bare-bones example, using only the Win32 API, without a CRT dependency. Here it is:
  @.str = private unnamed_addr constant [13 x i8] c"Hello world\0A\00"

  declare i64 @GetStdHandle(i32) nounwind
  declare i32 @WriteFile(i64, i8* nocapture, i32, i32* nocapture, i8*) nounwind

  define i32 @Entry() {
    %hwStr = getelementptr [13 x i8]* @.str, i64 0, i64 0
    %handle = call i64 @GetStdHandle(i32 -11)
    %bytes = alloca i32
    call i32 @WriteFile(i64 %handle, i8* %hwStr, i32 12, i32* %bytes, i8* null)
    ret i32 2
  }
Compile and link steps:
  llc -filetype=obj hw.ll
  link hw.obj kernel32.lib /entry:Entry /subsystem:console
The resulting executable has 2,560 bytes, prints "Hello world", and returns exit code 2.

My main observation is that the language is clumsy. It's meant to be read by humans, but in most cases, generated automatically:
  1. No syntax for hexadecimal integers. If you want to enter a value like 0xC0000005 – good luck! :-)
  2. The types of all values have to be explicitly qualified each time they're used. This is irritating when writing, but may serve as useful documentation when reading.
  3. Interestingly, there's no "address-of" operator. Instead, alloca creates space for a value on the stack, and gives you the address. Neat!

2016-04-12

Over-education and free college

I agree with most of Sanders' policies, much more than any other candidate. However, I cringe somewhat about his views on education, which seem only kinda half right. Sanders wants to give everyone free access to college, which would be better than what we have now. But...

What he doesn't say (or realize?) is – over-education should be discouraged, because it's competition for a positional good. Hanson and Caplan have had a good conversation about this over the years.

Compared to early employment, college is ineffective at conveying useful skills. It's a competition between individuals about who's going to be "better educated". In this competition, the nation can't do better as a whole by throwing more resources into it. You just get a lot of college graduates who then make for reluctant baristas, many having to settle for something "below" their degree.

The current college loans are unambiguously bad. They drive up the cost of participating in this competition that is pointless to begin with, and yet cannot be avoided because it's seen as the herald of everyone's economic success.

Making college free or low-cost can certainly be part of a solution. There will always be competition for this positional good, and it's important that this competition does not favor rich people. So good so far, I think.

But most importantly, people just shouldn't be going to college as much. For most folks, there's nothing to learn there.

And of course, this is unpopular because it hurts people's fetish for being educated. sigh There's so much identity invested into this.