Other aspects of computing work best with monospace, also. The Unix shells; PowerShell; the Windows Command Prompt. Email is still sent with a copy in plaintext, which has to be wrapped on a monospace boundary. Not least, this persists because HTML email is excessively difficult to render securely, and there are user agents that still work better with plaintext.
In all of these situations, the problem presents itself that the originator has to anticipate how text will be rendered in advance. You cannot just send text and expect the recipient to flow it. You have to predict the effects of Tab characters correctly, and word wrap the text in advance, often not knowing the software that will be used for display. In terminal emulation, e.g. xterm via SSH, when the server sends the client a character to render, the server and the client need to agree by how many positions to advance the cursor. If they disagree, the whole screen can become corrupted.
As long as you stick to precomposed Unicode characters, and Western scripts, things are relatively straightforward. Whether it's A or Å, S or Š – so long as there are no combining marks, you can count a single Unicode code point as one character width. So the following works:
aeioucsz áéíóúčšžNice and neat, right?
Unfortunately, problems appear with Asian characters. When displayed in monospace, many Asian characters occupy two character widths. How do we know which ones?
Our problems would be solved if the Unicode standard included this information.
If you're on Unix, you may have access to
wcwidth. However: "This function was removed from the final ISO/IEC 9899:1990/Amendment 1:1995 (E), and the return value for a non-printable wide character is not specified." What this means is that the results of
In 2007, Markus Kuhn implemented a generic version of
wcwidth, which we now use in the graphical SSH terminal console in Bitvise SSH Client. However, this is more than 8 years old at this point, and is based on Unicode 5.0, whereas the current latest version is 8.0.
So I had the idea that maybe we could "just" extract up-to-date information from Windows. It's 2015, the following should render well, right?
aeioucsz áéíóúčšž 台北1234 (leading characters should be 2 spaces each) abcdefgh ＱＲＳ12 (fullwidth latin; should be 2 spaces each) abcdefgh ｱｲｳ1234 (halfwidth kana; should be 1 space each) abcdefghIt turns out – no. Perhaps you have an operating system with proper monospace fonts, which displays all of the above lined up. On my Windows 8.1, the problem looks like this:
Note how nothing lines up: not in Internet Explorer; not in Chrome; not in Firefox; not in Notepad; not in the latest version of Visual Studio –
It turns out, when locale is set to English (United States), Windows just doesn't seem to use monospace fonts for Asian characters. Indeed, setting the Windows locale to Chinese (Simplified) produces this:
This is better; but now, the half-width kana are borked. sigh
Note that the above isn't a Windows problem only. This is how the same text displays on Android:
It boggles my mind that it's 2015, and we still don't have a single, authoritative answer to this question: how many character positions should each Unicode character occupy in a monospace font?
DiscussionBecause I'm providing examples of incorrect character rendering, this may offer the misleading impression that this is just a font problem.
This isn't just a font problem. It's that there's no standard monospace character width information, independent of font used.
The above incorrect renderings involve systems using non-monospace fallback fonts. However:
- Even if you only have a fallback font that's not mono, you can coerce it into the right character positions if you know the character widths. The above examples could work correctly – although the renderings might be less than perfect – if software knew the intended character widths.
- Even if you do not have a fallback font, and are just displaying placeholder boxes – you still need to know character widths to render the rest of the text properly, and for Tab characters to work.
Update and additional informationIt turns out that Unicode does in fact provide character width information for East-Asian characters. It's just not as neat as one number. When is it ever? :)
The information is in
EastAsianWidth.txt, which is part of the Unicode character database. The data provides an
East_Asian_Widthproperty, which is explained in this technical report.
This is basically what is needed... with some unfortunate limitations:
- Hundreds of characters are categorized as ambiguous width (property value
A). These characters include anything from U+00A1 (inverted exclamation mark, ¡) to U+2010 (hyphen, ‐) to U+FFFD (replacement character, �). Many of these characters (but not all!) have different widths depending on system locale. For example, U+00F7 (division character, ÷) has a width of 1 on Windows under English (United States), but a width of 2 under Chinese (Simplified, China).
- In some cases, width can differ even between different fonts under the same locale. For example, on Windows under Chinese (Simplified, China), U+FFFD (replacement character) renders as narrow (1 position) with a raster font, and wide (2 positions) as TrueType.
- Some characters categorized as one width are still displayed as another width by certain systems. For example, U+20A9 (Won sign, ₩) has width property value H (half-width), but is displayed as wide (two positions) by Windows under locale Chinese (Simplified, China). It is displayed as narrow under locale English (United States).
There are other efforts to provide information on character widths, including the
utf8proclibrary that's part of Julia. Interestingly, this library derives its information by extracting it from Unifont. Unifont, in turn, is an impressive open source Unicode font with a huge coverage of characters.