When monospace fonts aren't: The Unicode character width nightmare

Some things haven't changed since the 1970s. Programming is still done in text files; and though we have syntax highlighting and code completion, source code is still best displayed in monospace.

Other aspects of computing work best with monospace, also. The Unix shells; PowerShell; the Windows Command Prompt. Email is still sent with a copy in plaintext, which has to be wrapped on a monospace boundary. Not least, this persists because HTML email is excessively difficult to render securely, and there are user agents that still work better with plaintext.

In all of these situations, the problem presents itself that the originator has to anticipate how text will be rendered in advance. You cannot just send text and expect the recipient to flow it. You have to predict the effects of Tab characters correctly, and word wrap the text in advance, often not knowing the software that will be used for display. In terminal emulation, e.g. xterm via SSH, when the server sends the client a character to render, the server and the client need to agree by how many positions to advance the cursor. If they disagree, the whole screen can become corrupted.

As long as you stick to precomposed Unicode characters, and Western scripts, things are relatively straightforward. Whether it's A or Å, S or Š – so long as there are no combining marks, you can count a single Unicode code point as one character width. So the following works:
	aeioucsz
	áéíóúčšž
Nice and neat, right?

Unfortunately, problems appear with Asian characters. When displayed in monospace, many Asian characters occupy two character widths. How do we know which ones?

Our problems would be solved if the Unicode standard included this information. Unfortunately – as far as I can tell – the Unicode Consortium takes the stance that display issues are completely the renderer's problem, and makes no effort to include information about monospace character widths. (Edit – incorrect: see update below.)

If you're on Unix, you may have access to wcwidth. However: "This function was removed from the final ISO/IEC 9899:1990/Amendment 1:1995 (E), and the return value for a non-printable wide character is not specified." What this means is that the results of wcwidth are system-specific.

In 2007, Markus Kuhn implemented a generic version of wcwidth, which we now use in the graphical SSH terminal console in Bitvise SSH Client. However, this is more than 8 years old at this point, and is based on Unicode 5.0, whereas the current latest version is 8.0.

So I had the idea that maybe we could "just" extract up-to-date information from Windows. It's 2015, the following should render well, right?
	aeioucsz
	áéíóúčšž
	台北1234		(leading characters should be 2 spaces each)
	abcdefgh
	QRS12		(fullwidth latin; should be 2 spaces each)
	abcdefgh
	アイウ1234		(halfwidth kana; should be 1 space each)
	abcdefgh
It turns out – no. Perhaps you have an operating system with proper monospace fonts, which displays all of the above lined up. On my Windows 8.1, the problem looks like this:

IE Chrome Firefox Notepad VS 2015

Note how nothing lines up: not in Internet Explorer; not in Chrome; not in Firefox; not in Notepad; not in the latest version of Visual Studio – the environment in which Windows is developed (Edit: apparently not - see comments). Half-width kana are displayed kinda correctly by the Consolas font used in Notepad and Visual Studio; but that's it.

It turns out, when locale is set to English (United States), Windows just doesn't seem to use monospace fonts for Asian characters. Indeed, setting the Windows locale to Chinese (Simplified) produces this:


This is better; but now, the half-width kana are borked. sigh

Note that the above isn't a Windows problem only. This is how the same text displays on Android:


It boggles my mind that it's 2015, and we still don't have a single, authoritative answer to this question: how many character positions should each Unicode character occupy in a monospace font?

Discussion

Because I'm providing examples of incorrect character rendering, this may offer the misleading impression that this is just a font problem.

This isn't just a font problem. It's that there's no standard monospace character width information, independent of font used.

The above incorrect renderings involve systems using non-monospace fallback fonts. However:
  • Even if you only have a fallback font that's not mono, you can coerce it into the right character positions if you know the character widths. The above examples could work correctly – although the renderings might be less than perfect – if software knew the intended character widths.
  • Even if you do not have a fallback font, and are just displaying placeholder boxes – you still need to know character widths to render the rest of the text properly, and for Tab characters to work.
Operating systems could work around this problem by providing better font support. We now have terabyte hard drives, so there's no reason all cultures shouldn't be simultaneously supported. However, that still leaves the underlying issue – that we need standardized monospace character widths.

Update and additional information

It turns out that Unicode does in fact provide character width information for East-Asian characters. It's just not as neat as one number. When is it ever? :)

The information is in EastAsianWidth.txt, which is part of the Unicode character database. The data provides an East_Asian_Width property, which is explained in this technical report.

This is basically what is needed... with some unfortunate limitations:
  • Hundreds of characters are categorized as ambiguous width (property value A). These characters include anything from U+00A1 (inverted exclamation mark, ¡) to U+2010 (hyphen, ‐) to U+FFFD (replacement character, �). Many of these characters (but not all!) have different widths depending on system locale. For example, U+00F7 (division character, ÷) has a width of 1 on Windows under English (United States), but a width of 2 under Chinese (Simplified, China).
  • In some cases, width can differ even between different fonts under the same locale. For example, on Windows under Chinese (Simplified, China), U+FFFD (replacement character) renders as narrow (1 position) with a raster font, and wide (2 positions) as TrueType.
  • Some characters categorized as one width are still displayed as another width by certain systems. For example, U+20A9 (Won sign, ₩) has width property value H (half-width), but is displayed as wide (two positions) by Windows under locale Chinese (Simplified, China). It is displayed as narrow under locale English (United States).
There are also scripts like Devanagari that just don't seem to have a monospace representation. I was unable to get Windows to display Devanagari characters in console. They do display in Notepad, but they don't obey any kind of monospace font rules, at all.

There are other efforts to provide information on character widths, including the utf8proc library that's part of Julia. Interestingly, this library derives its information by extracting it from Unifont. Unifont, in turn, is an impressive open source Unicode font with a huge coverage of characters.

Comments

Pádraig Brady said…
FWIW with any of the 10s of fonts selectable within gnome-terminal on Fedora 22, the alignment is perfect. BTW, do you need 12345 along with the 3 halfwidth kana?
denis bider said…
Yes, there should be 12345 with the 3 halfwidth kana, but I had already taken the screenshots. :)
Conley said…
How to handle this kind of stuff in python:
http://stackoverflow.com/questions/30881811/how-do-you-get-the-display-width-of-combined-unicode-characters-in-python-3
AcidFlask said…
Thanks for the utf8proc shoutout!

A correction about UAX 11: East Asian Widths - it is not definitive regard to character widths. Section 2 states that "Instead, the guidelines on use of this property should be considered recommendations based on a particular legacy practice that may be overridden by implementations as necessary."

Furthermore, a great many characters have East_Asian_Width property 'N' (Not an East Asian Character) - and so Unicode does not provide any guidance as to how to deal with these cases.

Jiahao Chen, MIT Research Scientist
Dominik Dalek said…
Just for the record - Windows isn't developed in Visual Studio. Same-ish compiler is used but people use whatever editor/IDE they fancy (emacs, vim, many others - rarely VS).
D. Ongs said…
Liberation Sans Mono lines them up except for the halfwidth kana.
microcolonel said…
This seems more like a problem with GDI and/or DirectWrite(as well as how each browser is making use of them), less to do with chrome vs. firefox vs. IE vs. VS.

Chrome on Chrome OS (using FreeType 2) properly aligns the text in that <pre>, as does firefox on GNU/Linux (also using FreeType 2, in addition to graphite). On linux, both of them also render the full-width latin characters with the correct weight and face. Something that Chrome and IE on Windows seem to get wrong.
denis bider said…
Yeah, the renderings are definitely a platform problem. Windows just doesn't seem to have a multicultural, broad-coverage monospace font. Instead, as far as the console is concerned, there is a jumble of fonts with partial coverage, which are and aren't available depending on how the system locale is configured. This is unfortunate, especially given that open source systems appear to have full coverage fonts.
Paul Dal Bianco said…
This comment has been removed by the author.
Paul Dal Bianco said…
I have better luck with Consolas in Windows 8.
Chinese characters align in notepad, pfe and html.

Popular posts from this blog

"Unreachable" beauty standards

VS 2015 projects: "One or more errors occurred"