Characters in SI notations
This document discusses the character level issues of
presenting values of physical quantities according to the SI,
the international system of units
For general information on the SI, please refer to the
Metric System FAQ. Note especially its item
1.12, What is the correct way of writing metric units?,
which also mentions some practical typing methods not discussed here.
Conceptual levels of SI notations
The use of the SI can be considered at different levels, which
are defined by different standards, conventions, and other
- physical definitions of units, by the
BIPM, established by an international convention;
the definitions are often complicated
in order to be exact; and they need to name the units somehow,
but the different language-dependent names are not defined in this context; example: “The meter is the length of the path travelled by light in vacuum during a time interval of
1/299 792 458 of a second.”
- names of units, such as
“metre” (British English),
“meter” (US English),
etc.; these are defined by various language authorities,
or just by common usage in a language community
- symbols of units, such as “m” for the meter;
these symbols, too, are defined by the BIPM,
and intended for international use as such;
however, in some cultures, otherwise applying the SI,
language-dependent abbreviations are used instead,
such as кг for kilogram in Russian
- use of prefixes
for multiples and submultiples of units, such as
“km”, written as “kilometre” in
for 1 000 m;
these too are defined by BIPM, but other norms, such as national
standards, have added further recommendations, such as the recommendation
to avoid the prefix “h” (“hecto-” in English),
except perhaps for special use; similarly to units, the prefixes are supposed
to have an internationally standardized, language-independent symbol and
language-dependent names (generally sharing a common origin)
of quantities using a numeric value and a unit,
perhaps with a prefix,
such as “1,5 km” or “1.5 km”,
depending on language, or maybe e.g.
“1.5 × 103 m”;
this too is defined by the BIPM, with
additional recommendations from other sources
- the exact identification of characters used
to write the expressions; since the BIPM and other
definitions generally do not identify characters except by showing them,
this is a somewhat grey area
- typography, such as the width of a space used to separate
a number from a unit, or the use of a particular font to render a
character like “m”, such as
Times New Roman “m”
or Arial “m”;
this is generally not standardized but left to typographers,
except that there is a strong recommendation
to use “upright” letters and not an
This document discusses the last but one level, characters, or
abstract characters to be more exact. For a presentation
of the character concept in the information technology context,
please refer to A tutorial
on character code issues.
Notes on individual characters
Most characters used in SI notations can easily be
identified as abstract characters, or more specifically as
Unicode characters. For example, the symbol of the meter,
“m”, is apparently the character named
Latin small letter m in Unicode,
with the code position 6D in hexadecimal, therefore often denoted
by U+006D in Unicode contexts. But the following characters need
to be considered:
- The multiplication symbols, which are
used in numeric expressions like the alternative notations
They might be identified with the Unicode characters
middle dot (U+00B7) and
multiplication sign (U+00D7).
The former is also used in symbols for compound units such as
“N · m” (newton metre;
alternatively written as
“N m” or as
However, it can be argued that
middle dot is
a punctuation character and that the dot used for multiplication
(called “half-high dot” in the ISO 31-0 standard)
should be identified with
which is classified
as a mathematical operator.
This would mean a notation like N ⋅m.
A practical argument in favor of this is that
the representative glyph for
in the Unicode
code chart is a larger dot than that of the
middle dot, hence more
noticeable and more suitable for use as an operator.
And in the Arial
Unicode MS font –
one of the few fonts that has a fairly good repertoire
of mathematical symbols – the situation is the same and
is at a somewhat higher position. It is positioned in
a way that corresponds better to the notion of a multiplication
operator. You might see this, if your system has Arial Unicode MS
installed, from the following that contains letter x,
and letter x again in that font in large size:
The ISO 80000-2 standard now unambiguously identifies
the dot used in multiplication as
even though it calls it with other names as well.
- The division symbol used for constructing
derived units like “m/s” (metres per second)
is most logically identified with the
However, this character is not present in most fonts, so it is normal to
use the Ascii
solidus (U+002F), or slash, character as surrogate.
In theory, division slash
would be preferable, since it has a more exact meaning.
- The minus sign used before a number
(in an exponent, too),
is logically to be identified with the
However, this character does not belong to ISO Latin 1 or
even Windows Latin 1, so
it might be
a reasonable compromise to replace it by the
which is more widely supported, or with the
hyphen-minus (U+002D), which has effectively universal
A problem with these is that
Unicode line breaking rules permit a line break after
these characters. This creates the risk of having the sign
appear at the end of a line and the number at the start of the next line.
(This should not happen for the real
There are various ways to try to avoid this problem, e.g. using
nobr markup in
HTML authoring. It has been suggested that the
nonbreaking hyphen character
could be used too, but e.g. on Internet Explorer it also prevents line
breaks before it (even after a space), which is usually
not desirable. Using the hyphen-minus has the additional problem that
it is typically rather short and does not really look like
good old minus sign.
- The space between a numeric value
and a unit (or between unit symbols when multiplication of units
is indicated in this less satisfactory way). It is difficult to say
how the space is to be interpreted in Unicode, considering the
multitude of space characters in Unicode.
Presumably any space character, excluding those with zero width, is
acceptable. Using the
no-break space (U+00A0) character would help
in preventing undesired line breaks between the number and the unit.
Using the thin space
(U+2009) character would help in making the space narrower than
a normal space between words. The problem is that these two
cannot be combined in a single Unicode character, in the present
repertoire of Unicode. There are different possible approaches:
- Use thin space with
a word joiner (WJ) character
(U+2060) before and after it to prevent line breaks.
This is both clumsy and unreliable, since the
word joiner character
is poorly supported by existing software.
- Use no-break space with
formatting suggestions that try to reduce the width of that
character. For example, in HTML or XML authoring you could
word-spacing property in CSS to a negative
- Use either no-break space or
thin space, and ignore the rest
of the problem, or deal with it manually.
- The exponents used in some numeric values
(such as “1.5×103”) as well as
in many compound unit symbols (such as
“s−1”). The numbers 2 and 3 as exponents can
be easily represented using the characters for them,
superscript two (U+00B2)
superscript three (U+00B3).
Unicode contains also other digits and the minus sign as exponent,
but these characters have very limited support in programs and
fonts. Hence, it is better to use the tools of text processing systems
or other methods (such as
sup markup in HTML) for
superscripting for them. For typographic reasons, it is best to
represent all superscript using that way if you need anything else
that just 2 or 3. Otherwise the visual difference in superscripting of
e.g. 2 and −1 is too disturbing.
- The symbol of micro prefix,
corresponding to multiplication by 10−6.
candidate is the
micro sign (U+00B5), µ, which is widely
available in fonts.
However, Unicode defines
micro sign as a
Greek small letter mu (
as its compatibility decomposition.
This means that the two are distinct characters but the
micro sign has been included
for legacy reasons only, and the two are equivalent except perhaps for
formatting information. In practice, the characters are very often
similar in appearance. Since the micro sign
is more widely available, it is probably to be preferred. It might also
be argued that it has unambiguous semantics, whereas
Greek small letter mu
is primarily a letter and has varying other uses as well.
- The symbol for ohm can be identified with
ohm sign (Ω), U+2126.
It is character with a specific meaning (in the Symbols Area),
but it is defined as being
Greek capital letter omega (Ω),
U+03A9. Although Unicode recommends the use of capital omega rather than
the ohm sign, the latter has been reported to have better coverage in fonts.
- The degree symbol is naturally the
Metric System FAQ
explains (in clause 1.12) the common
confusion between this symbol and the
masculine ordinal indicator.
These characters look very similar or even identical in many fonts,
but in other fonts, they are rather different.
1º (one followed by masculine ordinal
indicator, hence meaning primero
looks different from
1° (one degree).
- The symbols for minutes and seconds
in expressions for angles should be identified with
the double prime,
However, these characters are rarely available, so it is common to
use the Ascii
and the Ascii
quotation mark (U+0022) as surrogates.
In visual appearance,
are clearly slanted, whereas
quotation mark should have straight
(vertical) glyphs according to Unicode, and they often have.
- Several letterlike symbols in Unicode
denote characters used in the SI context, in a sense.
But this is mostly an illusion, and a misleading one.
the script small l,
is often used as a symbol for litre.
NIST Guide to SI units explicitly says that
“The script letter ℓ
is not an approved symbol for the liter.”
Such confusions will be separately discussed in the sequel.
People interested in unit symbols and Unicode have become
surprised when they have found that e.g.
the unit “degrees Celsius” has a symbol of
its own, U+2103, presenting °C as a single character. Similarly for
degrees Fahrenheit (a completely non-SI unit of
course) there is U+2109, for siemens U+2127, and for kelvin
U+212A, for example, in the Letterlike Symbols
block. Educated people may well think that
it is better to use such specific characters, with
especially if dealing with documents which might be read by a
text-to-speech converter later on, or otherwise
processed by software that might utilize semantic information
about characters. They might also be seen as typographically
suitable, since they allow detailed formatting that corresponds
to the specific meanings.
But in addition to being poorly supported in most fonts,
such characters are inadequate in principle, by Unicode rules.
For example, degrees celsius
equivalent to U+00B0 U+0043 (i.e., degree sign followed by letter C).
It has little to do with typographic correctness. Rather, it is a matter
of compatibility, so that data containing that character in some
non-Unicode encoding can be encoded in Unicode without losing the distinction
between that character and the U+00B0 U+0043 pair, should someone wish to
retain that distinction. This means that the data can also be converted
back to the original encoding and get the original data exactly.
It is not recommended for use in new, originally Unicode data.
The Unicode standard
says, in chapter Symbols:
Several letterlike symbols are used to indicate units. In
most cases, however, such as for SI units (Système International), the
use of regular letters or other symbols is preferred. U+2113 SCRIPT
SMALL L is commonly used as a non-SI symbol for the
SI usage prefers the regular lowercase letter l.
Three letterlike symbols have been given canonical equivalence to regular
letters: U+2126 OHM SIGN, U+211A KELVIN SIGN, and U+211B ANGSTROM SIGN.
In all three instances the regular letter should be used. […]
In normal use,
it is better to represent degrees Celsius “°C”
with a sequence of U+00B0
DEGREE SIGN + U+0043 LATIN CAPITAL LETTER C, rather than
U+2103 DEGREE CELSIUS. For searching, treat these two sequences as
Unfortunately the Unicode
standard has wrong information about the symbol for the
litre. The official position in the SI system is that both “l” and “L”
are allowed, with no expressed preference. In the US,
preferred by national authorities.
The ISO 80000-2 standard says that ISO uses lowercase l
As regards to the question why the special letterlike characters
exist in the first place,
a Usenet posting by Markus Kuhn explains:
Old ideographic character sets from East Asia, for example
JIS X 0212, contain lots of characters for individual SI units.
Design goal of Unicode was to be round-trip compatible with
all these characters. This means, it must be possible to
convert JIS X 0212 to Unicode and back to JIS X 0212, without
any loss of information. As a result, Unicode now contains a lot
of nonsense characters that really nobody should be using.
The characters that you should use are those in Unicode Normalization
Form C. Unfortunately, not too many people have actually read
the Unicode standard, which is available from Addison Wesley and
is thicker than many telephone books. People know Unicode only from
simple-minded selection tables and often pick the completely wrong
characters, as these tables to not show the descriptive comments that
the standard provides for each character.
To conclude, it is acceptable and recommendable
to use normal Latin letters as SI unit symbols, such as
“K” for kelvin.