When monospace fonts aren't: The Unicode character width nightmare

Some things haven't changed since the 1970s. Programming is still done in text files; and though we have syntax highlighting and code completion, source code is still best displayed in monospace.

Other aspects of computing work best with monospace, also. The Unix shells; PowerShell; the Windows Command Prompt. Email is still sent with a copy in plaintext, which has to be wrapped on a monospace boundary. Not least, this persists because HTML email is excessively difficult to render securely, and there are user agents that still work better with plaintext.

In all of these situations, the problem presents itself that the originator has to anticipate how text will be rendered in advance. You cannot just send text and expect the recipient to flow it. You have to predict the effects of Tab characters correctly, and word wrap the text in advance, often not knowing the software that will be used for display. In terminal emulation, e.g. xterm via SSH, when the server sends the client a character to render, the server and the client need to agree by how many positions to advance the cursor. If they disagree, the whole screen can become corrupted.

As long as you stick to precomposed Unicode characters, and Western scripts, things are relatively straightforward. Whether it's A or Å, S or Š – so long as there are no combining marks, you can count a single Unicode code point as one character width. So the following works:
Nice and neat, right?

Unfortunately, problems appear with Asian characters. When displayed in monospace, many Asian characters occupy two character widths. How do we know which ones?

Our problems would be solved if the Unicode standard included this information. Unfortunately – as far as I can tell – the Unicode Consortium takes the stance that display issues are completely the renderer's problem, and makes no effort to include information about monospace character widths. (Edit – incorrect: see update below.)

If you're on Unix, you may have access to wcwidth. However: "This function was removed from the final ISO/IEC 9899:1990/Amendment 1:1995 (E), and the return value for a non-printable wide character is not specified." What this means is that the results of wcwidth are system-specific.

In 2007, Markus Kuhn implemented a generic version of wcwidth, which we now use in the graphical SSH terminal console in Bitvise SSH Client. However, this is more than 8 years old at this point, and is based on Unicode 5.0, whereas the current latest version is 8.0.

So I had the idea that maybe we could "just" extract up-to-date information from Windows. It's 2015, the following should render well, right?
	台北1234		(leading characters should be 2 spaces each)
	QRS12		(fullwidth latin; should be 2 spaces each)
	アイウ1234		(halfwidth kana; should be 1 space each)
It turns out – no. Perhaps you have an operating system with proper monospace fonts, which displays all of the above lined up. On my Windows 8.1, the problem looks like this:

IE Chrome Firefox Notepad VS 2015

Note how nothing lines up: not in Internet Explorer; not in Chrome; not in Firefox; not in Notepad; not in the latest version of Visual Studio – the environment in which Windows is developed (Edit: apparently not - see comments). Half-width kana are displayed kinda correctly by the Consolas font used in Notepad and Visual Studio; but that's it.

It turns out, when locale is set to English (United States), Windows just doesn't seem to use monospace fonts for Asian characters. Indeed, setting the Windows locale to Chinese (Simplified) produces this:

This is better; but now, the half-width kana are borked. sigh

Note that the above isn't a Windows problem only. This is how the same text displays on Android:

It boggles my mind that it's 2015, and we still don't have a single, authoritative answer to this question: how many character positions should each Unicode character occupy in a monospace font?


Because I'm providing examples of incorrect character rendering, this may offer the misleading impression that this is just a font problem.

This isn't just a font problem. It's that there's no standard monospace character width information, independent of font used.

The above incorrect renderings involve systems using non-monospace fallback fonts. However:
  • Even if you only have a fallback font that's not mono, you can coerce it into the right character positions if you know the character widths. The above examples could work correctly – although the renderings might be less than perfect – if software knew the intended character widths.
  • Even if you do not have a fallback font, and are just displaying placeholder boxes – you still need to know character widths to render the rest of the text properly, and for Tab characters to work.
Operating systems could work around this problem by providing better font support. We now have terabyte hard drives, so there's no reason all cultures shouldn't be simultaneously supported. However, that still leaves the underlying issue – that we need standardized monospace character widths.

Update and additional information

It turns out that Unicode does in fact provide character width information for East-Asian characters. It's just not as neat as one number. When is it ever? :)

The information is in EastAsianWidth.txt, which is part of the Unicode character database. The data provides an East_Asian_Width property, which is explained in this technical report.

This is basically what is needed... with some unfortunate limitations:
  • Hundreds of characters are categorized as ambiguous width (property value A). These characters include anything from U+00A1 (inverted exclamation mark, ¡) to U+2010 (hyphen, ‐) to U+FFFD (replacement character, �). Many of these characters (but not all!) have different widths depending on system locale. For example, U+00F7 (division character, ÷) has a width of 1 on Windows under English (United States), but a width of 2 under Chinese (Simplified, China).
  • In some cases, width can differ even between different fonts under the same locale. For example, on Windows under Chinese (Simplified, China), U+FFFD (replacement character) renders as narrow (1 position) with a raster font, and wide (2 positions) as TrueType.
  • Some characters categorized as one width are still displayed as another width by certain systems. For example, U+20A9 (Won sign, ₩) has width property value H (half-width), but is displayed as wide (two positions) by Windows under locale Chinese (Simplified, China). It is displayed as narrow under locale English (United States).
There are also scripts like Devanagari that just don't seem to have a monospace representation. I was unable to get Windows to display Devanagari characters in console. They do display in Notepad, but they don't obey any kind of monospace font rules, at all.

There are other efforts to provide information on character widths, including the utf8proc library that's part of Julia. Interestingly, this library derives its information by extracting it from Unifont. Unifont, in turn, is an impressive open source Unicode font with a huge coverage of characters.


Pádraig Brady said…
FWIW with any of the 10s of fonts selectable within gnome-terminal on Fedora 22, the alignment is perfect. BTW, do you need 12345 along with the 3 halfwidth kana?
denis bider said…
Yes, there should be 12345 with the 3 halfwidth kana, but I had already taken the screenshots. :)
Conley said…
How to handle this kind of stuff in python:
AcidFlask said…
Thanks for the utf8proc shoutout!

A correction about UAX 11: East Asian Widths - it is not definitive regard to character widths. Section 2 states that "Instead, the guidelines on use of this property should be considered recommendations based on a particular legacy practice that may be overridden by implementations as necessary."

Furthermore, a great many characters have East_Asian_Width property 'N' (Not an East Asian Character) - and so Unicode does not provide any guidance as to how to deal with these cases.

Jiahao Chen, MIT Research Scientist
Dominik Dalek said…
Just for the record - Windows isn't developed in Visual Studio. Same-ish compiler is used but people use whatever editor/IDE they fancy (emacs, vim, many others - rarely VS).
D. Ongs said…
Liberation Sans Mono lines them up except for the halfwidth kana.
microcolonel said…
This seems more like a problem with GDI and/or DirectWrite(as well as how each browser is making use of them), less to do with chrome vs. firefox vs. IE vs. VS.

Chrome on Chrome OS (using FreeType 2) properly aligns the text in that <pre>, as does firefox on GNU/Linux (also using FreeType 2, in addition to graphite). On linux, both of them also render the full-width latin characters with the correct weight and face. Something that Chrome and IE on Windows seem to get wrong.
denis bider said…
Yeah, the renderings are definitely a platform problem. Windows just doesn't seem to have a multicultural, broad-coverage monospace font. Instead, as far as the console is concerned, there is a jumble of fonts with partial coverage, which are and aren't available depending on how the system locale is configured. This is unfortunate, especially given that open source systems appear to have full coverage fonts.
Unknown said…
This comment has been removed by the author.
Unknown said…
I have better luck with Consolas in Windows 8.
Chinese characters align in notepad, pfe and html.
Rajiv said…
A rational justification for monospace fonts would be appreciated.
denis bider said…
Rajiv: a number of existing technologies were created under the assumption that characters encoded in sequence can be displayed in sequence, and that when encoded they will occupy either 0, 1, or 2 character widths. These technologies have the advantage that they're simple and they work. They are in widespread use everywhere in the world. They are in use in a variety of legacy systems, but they are essential in computer administration, and in that role I don't expect they are going away. They are certainly not going to go away just to accommodate scripts without monospace representations.

Scripts that are incompatible with monospace representations are fundamentally incompatible with the above described technologies. This is usually not a problem because there is little intersection between the use cases for those scripts, and the use cases for monospace technologies.

However, for those scripts that can be represented in monospace, it helps if fonts are available that display them that way, so that they can be used with monospace technologies.
denis bider said…
The above is a rational argument from a tolerant perspective. I can also make an argument from an intolerant perspective.

The intolerant argument is that the diversity of languages and scripts that exist in the world is shit. It increases the oppressiveness of geopolitical borders and is the main obstacle that prevents ideas crossing them. It prevents communication, protects harmful idiosyncrasies and local fiefdoms, and enables the most harmful cognitive bias in the world – the in-group/out-group dynamic – because our cultures are separated by language and script, and make us foreign to each other in the world.

Ideally, there would be one language, one script, and all the rest should go into a museum and never be touched again. It is variety for the sake of variety, and its effects are economically, socially, and politically evil.
amn said…
I don't see why Unicode consortium should bother with defining how many English character widths a Japanese glyph should occupy. That makes no sense to me.
denis bider said…
For the basic reason that standards exist, which is to coordinate. Coordination solves problems in a way radically more effective than solutions that do not involve coordination. If you were unaware of the problems regarding display width, I outlined some of them in the post.

You appear to be an uninsightful commenter who does not bother reading, so I do not welcome further discussion.
amn said…
Appearances can be deceiving. You appear to know a lot about Unicode, but who am I to judge -- I certainly don't know as much as you do. But coordination does not equal becoming an authority on how many trees there are in a cloud. Only a subset of Unicode makes sense in a source code. For one, how will you even display top-to-bottom text in a source code file that's otherwise comprised of pretty much ASCII? Where do you render the comments? Does it make sense, or even useful, that a certain foreign glyph is as wide as every letter rendered with a particular monospace font? No, it doesn't. You are trying to draw comparison between apples and oranges and blaming Unicode for not telling you how many apples equals an orange and vice-versa, which is indeed the underlying reason for you unable to implement proper multilingual text rendering in your terminal emulator.

You either have to concede to the notion that *character*-based terminal emulators may be unfit for displaying text in multiple languages simultaneously, or you need to understand that it's not the job of Unicode to mandate widths and heights of the characters, the notion loses their meaning, if you ask me. It makes sense for a terminal displaying source code, but that's not Unicode's job, although I agree that their purpose is to coordinate. I am not saying multilingual text rendering layout cannot be standardized, but that's not Unicode's job!

Good day. I don't mean to offend you, but I don't know you and I don't agree with you. If you are looking for emotional support, feel free to withold my seemingly aggressive comment from your blog, what can I tell you. I don't have a habit of sprinkling niceties into these kind of discussions. I don't know why would I. I state my opinions, you could have stopped with yours without asking me to leave. This is the Internet, not a nightclub.

Also, I have read your post, how do you think I came here? To insult you?
denis bider said…
Unicode has taken upon itself to standardize smileys like this one: 🤦🏻‍♂️

That is a Unicode character. If Unicode has taken upon itself to standardize this, then it could certainly standardize character widths for scripts where it makes sense. It doesn't make sense of Devanagari, but it certainly makes sense for Asian scripts which have been used in monospace terminals for decades.

Again, I do not appreciate further discussion, not because I'm offended but because it's stupid.
Unknown said…
Experimental monospaced font for Devanagari script: https://github.com/monotty/fonts

Popular posts from this blog

The horrid beeping of Tripp Lite UPS

Tim Gielen: Monopoly - Who owns the world?