2015-11-15

The advantages of Seq, and the demerits of std::string const&

A few weeks ago, I was reviewing some code, and found something similar to this:
 
  std::string host = ...;

  int port = 23;
  size_t pos = host.find(":");
  if (pos != host.npos)
  {
      port = atoi(host.substr(pos + 1).c_str());
      host = host.substr(0, pos);
  }

Looks reasonable enough. This code parses a parameter in “host:port” format. It's perfectly decent where it appears. It will run rarely, once per process invocation, so performance is irrelevant. It’s an acceptable way to achieve what’s intended.

But suppose this code was in a tight inner loop. Suppose it was part of a parser that needs to digest hundreds of megabytes of data, and performance is relevant.

In that case, this code does two suboptimal things:
  • A heap allocation to copy the “port” portion of the string into. Why not just read the number from the original string?
  • Another heap allocation to copy the “host” portion of the string. Again, why not just read from the original string?

Seq as an improvement over std::string const&

There are two purposes for which C++ programmers commonly use strings:
  1. To store and own character content. This is what string and wstring do.
  2. To pass character content without passing ownership. This is what string const& does.
I argue that passing string const& is almost always a mistake. It chains the string provider in ways that aren't necessary for the consumer to read the string. All you really need is to pass a pointer and a length. You need a lightweight Seq object.

A Seq object, essentially, is this:
 
  struct Seq
  {
      byte const* p { nullptr };
      size_t      n { 0 };
  };

In practice, a useful Seq implementation will also contain numerous methods that a user can use to read from the Seq. My implementation has these, among others:
 
  struct Seq
  {
      ...
      uint   ReadByte             (...)
      uint   ReadHexEncodedByte   (...) 
      uint   ReadUtf8Char         (...) 
      Seq    ReadBytes            (...) 
      Seq    ReadUtf8_MaxBytes    (...) 
      Seq    ReadUtf8_MaxChars    (...) 
      Seq    ReadToByte           (...) 
      Seq    ReadToFirstOf        (...) 
      Seq    ReadToFirstOfType    (...) 
      Seq    ReadToFirstNotOf     (...) 
      Seq    ReadToFirstNotOfType (...)
      Seq    ReadToString         (...) 
      Seq    ReadLeadingNewLine   (...) 
      uint64 ReadNrUInt64         (...) 
      int64  ReadNrSInt64         (...) 
      uint32 ReadNrUInt32         (...) 
      uint16 ReadNrUInt16         (...) 
      byte   ReadNrByte           (...) 
      uint64 ReadNrUInt64Dec      (...) 
      int64  ReadNrSInt64Dec      (...) 
      uint32 ReadNrUInt32Dec      (...) 
      uint16 ReadNrUInt16Dec      (...) 
      byte   ReadNrByteDec        (...) 
      double ReadDouble           (...) 
      Time   ReadIsoStyleTimeStr  (...)
      ...
  };

You get the idea. All the basic primitives you'd need to read character content belong in Seq.

The basic benefit of Seq is that it's lightweight, containing only a pointer and a length, and can point not just to a whole string, but also a substring. It does not require unnecessary functionality, like a whole string object, just to pass a sequence of characters without ownership.

A secondary, but even more central benefit is that it serves as a focal point for a powerful set of string reading methods that leverage each other, allowing for both elegant and efficient string reading.

Using Seq, the earlier "host:port" example can be rewritten like this:
 
  Seq hostPort = ...;

  Seq host = hostPort.ReadToByte(:);
  uint32 port = 23;
  if (hostPort.n)
      port = hostPort.Drop(1).ReadNrUInt32Dec();

This is not more complex than the string version. Yet this version does its task without unnecessary heap allocations, and would be much more efficient if implemented where performance matters.

So, it's like std::string_view?

The proposed C++ extension std::string_view implements a similar concept. Main differences:
  • std::string_view is mainly the lightweight reference. It lacks a powerful library of string reading methods. Seq, as in the example above, shows emphasis on stream-like reading. Read methods consume part of Seq and return the part that was read as another Seq. A fully useful Seq implementation covers the basic primitives of string reading in an elegant way.
  • std::string_view is an std::long_inconvenient_name. However, this is understandable given a standard library designed by dark warlocks whose mystical powers derive from conjuring, and causing the world to use, long inconvenient names. :)
I emphasize the use of Seq as a default for string passing and reading, not special case. This is encouraged by giving it a practical name, and building a library of string reading methods around it.

std::string_view could do the same, but it needs more power than just remove_prefix and remove_suffix.

8 comments:

Paul Beckingham said...

I like your article, I'm working with something very similar. I'm curious how, for example, your Seq::ReadNrUInt32 method behaves if there is not a UInt32 to read?

denis bider said...

My implementation of the Read integer functions returns the maximum value (e.g. UINT32_MAX) if the sequence does not start with a number. This is useful for casual string reading where strong validation is not required, or where the string has been pre-validated by a parser (so you know there is a number there).

If strong input validation is required, I use full-fledged parser machinery. This also uses Seq internally - but I don't use e.g. ReadNrUInt32Dec() to validate.

Marco said...

Hi there, nice article! Just a note: array_view and string_view have been renamed to span and string_span.

John Melas said...

Nice article! I am a huge fan of boost::string_ref and the upcoming std::string_view.
Another improvement on the initial code is
replace
host = host.substr(0, pos);
with
host.resize(pos);

denis bider said...

Marco - "span" sounds like a good name. It would still be nice to see just span and wspan for commonly used character spans.

The name string_span seems misleading. It suggests either:

- a span that requires an underlying string object (which it does not, it can be a span of characters from a string literal, or from a vector, or from manually managed storage;

- or maybe, even, a span of string objects.

In HTML, the tag "span" is used for inline character content. It seems to me it would be quite intuitive to use the word for this purpose in C++.

The array span is also a useful concept that I can easily imagine using. It seems that also needs a practical name. Maybe that could be a seq or a sequence.

Or maybe the array span can stay span, and string_span / wstring_span could be changed to cspan and wcspan for wchar_t.

Arne Mertz said...

The additional functionality could well be implemented as free functions couldn't it? And for the shorter names we can always resort to aliases/typedefs ;-)

denis bider said...

My main problem with free functions, as supported by the language currently, is the trouble with chaining. If you want to chain them, you basically need the object to have something like this:

struct Seq
{
template <class F, typename... Args>
Seq& Fun(F f, Args&&... args)
{ f(*this, std::forward(args)...); return *this; }
};

Now you can do:

Seq(str).Fun(ReadThis, arg1).Fun(ReadThat, arg2);

but that is less readable than:

Seq(str).ReadThis(arg1).ReadThat(arg2);

There is a proposal by Bjarne Stroustrup to unify f(x,a) and x.f(a), so that the two are equivalent and interchangeable. If that were implemented, the free function chaining problem would be solved.

Unfortunately, at the last standards meeting (search for the keyword "Unified"), only part of the proposal was adopted: f(x,a) finds x.f(a), but not the other way around. This avoids solving the free function chaining problem... For now. :-|

denis bider said...

Gah, I hate this commenting interface. :) I meant to include this link to Bjarne's report from the last meeting.