Characters in SI notations

This document discusses the character level issues of presenting values of physical quantities according to the SI, the international system of units (Système international). For general information on the SI, please refer to the Metric System FAQ. Note especially its item 1.12, What is the correct way of writing metric units?, which also mentions some practical typing methods not discussed here.

Conceptual levels of SI notations

The use of the SI can be considered at different levels, which are defined by different standards, conventions, and other norms:

This document discusses the last but one level, characters, or abstract characters to be more exact. For a presentation of the character concept in the information technology context, please refer to A tutorial on character code issues.

Notes on individual characters

Most characters used in SI notations can easily be identified as abstract characters, or more specifically as Unicode characters. For example, the symbol of the meter, “m”, is apparently the character named Latin small letter m in Unicode, with the code position 6D in hexadecimal, therefore often denoted by U+006D in Unicode contexts. But the following characters need to be considered:

Letterlike symbols

People interested in unit symbols and Unicode have become surprised when they have found that e.g. the unit “degrees Celsius” has a symbol of its own, U+2103, presenting °C as a single character. Similarly for degrees Fahrenheit (a completely non-SI unit of course) there is U+2109, for siemens U+2127, and for kelvin U+212A, for example, in the Letterlike Symbols block. Educated people may well think that it is better to use such specific characters, with limited semantics, especially if dealing with documents which might be read by a text-to-speech converter later on, or otherwise processed by software that might utilize semantic information about characters. They might also be seen as typographically suitable, since they allow detailed formatting that corresponds to the specific meanings.

But in addition to being poorly supported in most fonts, such characters are inadequate in principle, by Unicode rules. For example, degrees celsius U+2103 is a compatibility equivalent to U+00B0 U+0043 (i.e., degree sign followed by letter C). It has little to do with typographic correctness. Rather, it is a matter of compatibility, so that data containing that character in some non-Unicode encoding can be encoded in Unicode without losing the distinction between that character and the U+00B0 U+0043 pair, should someone wish to retain that distinction. This means that the data can also be converted back to the original encoding and get the original data exactly. It is not recommended for use in new, originally Unicode data.

The Unicode standard says, in chapter Symbols:

Unit Symbols. Several letterlike symbols are used to indicate units. In most cases, however, such as for SI units (Système International), the use of regular letters or other symbols is preferred. U+2113 SCRIPT SMALL L is commonly used as a non-SI symbol for the liter. Official SI usage prefers the regular lowercase letter l.

Three letterlike symbols have been given canonical equivalence to regular letters: U+2126 OHM SIGN, U+211A KELVIN SIGN, and U+211B ANGSTROM SIGN. In all three instances the regular letter should be used. […]

In normal use, it is better to represent degrees Celsius “°C” with a sequence of U+00B0 DEGREE SIGN + U+0043 LATIN CAPITAL LETTER C, rather than U+2103 DEGREE CELSIUS. For searching, treat these two sequences as identical.

Unfortunately the Unicode standard has wrong information about the symbol for the litre. The official position in the SI system is that both “l” and “L” are allowed, with no expressed preference. In the US, “L” is preferred by national authorities. The ISO 80000-2 standard says that ISO uses lowercase l only.

As regards to the question why the special letterlike characters exist in the first place, a Usenet posting by Markus Kuhn explains:

Old ideographic character sets from East Asia, for example JIS X 0212, contain lots of characters for individual SI units. Design goal of Unicode was to be round-trip compatible with all these characters. This means, it must be possible to convert JIS X 0212 to Unicode and back to JIS X 0212, without any loss of information. As a result, Unicode now contains a lot of nonsense characters that really nobody should be using. The characters that you should use are those in Unicode Normalization Form C. Unfortunately, not too many people have actually read the Unicode standard, which is available from Addison Wesley and is thicker than many telephone books. People know Unicode only from simple-minded selection tables and often pick the completely wrong characters, as these tables to not show the descriptive comments that the standard provides for each character.

To conclude, it is acceptable and recommendable to use normal Latin letters as SI unit symbols, such as “K” for kelvin.