The so-called MS Windows character set, or Windows Latin 1, contains, in addition to ISO Latin 1 (ISO 8859-1) characters, some special characters like em dash, trademark symbol, and asymmetric quote characters. A Web author who works in a Windows environment may not realize that by using such characters he creates problems to some users. Typically, if an author naively types a trademark symbol, a browser running on Unix or some other non-Windows system may display a blank instead of the trademark symbol, or something worse. This document explains this problem in some detail and outlines various solutions.
The following characters
are still somewhat risky in HTML documents:

The same applies to euro sign, as well as to Z and z with caron, with the additional note that since they are additions to the original MS Windows character set, they have caused even more problems than the others.
There is nothing wrong with the characters discussed here. They have their legitimate uses, and they are, as characters, part of many other character repertoires too, such as Unicode and ISO 10646. The problem is that they cannot be used completely reliably in HTML.
There is a very large number of useful characters in Unicode, but the great majority of them cannot be used in HTML documents with reasonably good success yet. The situation is improving, as the so-called internationalization of the Web proceeds and gets implemented. But often we still must live with rather limited character repertoires, unless we are willing to restrict accessibility radically. Consider, for example, the difficulties in using mathematical symbols, or phonetic (IPA) characters, which are very essential in certain areas. Compared with that, the need to use a hyphen instead of en dash, for example, is a relatively small detail. The use of typographically correct punctuation is a matter of esthetics, as opposite to delivering a message.
This document is not intended to make any judgment on the MS Windows systems themselves. And the characters discussed here can be used within and between Windows systems using the MS Windows character encoding.
The point is that on the World Wide Web, one should not expect that vendor-specific, system-dependent encodings work universally. There are standardized methods of presenting the characters under discussion, but they may fail to work too, for different reasons.
|
The main reason why the characters discussed here cause problems is that various attempts to present them create an illusion of working. When you create an HTML document and either consciously or unconsciously use, for example, the trademark symbol, you will probably see it right on your browser, and so will many others. But a large number of other people will see just a blank, or even have their display messed up by some control function. |
|
Although the trademark symbol, for example,
probably looks somewhat better than
the result of using a replacement (like HTML markup
<SUP>(TM)</SUP>,
which looks like the following on your current browser:
(TM)),
the gain is rather small as compared with the damage caused
when the vendor-specific method of presenting the symbol does not work
at all, i.e. information is lost.
Of course, in some cases this might not matter so much
while in others it can be quite serious (see the examples).
But note that the effect varies; it's need not be simply a space,
though this is a common situation. (Bob Baumel's
document on special characters contains some
examples of different behavior.)
Naturally, the warnings equally apply to any cross-platform transfer of data. However, when data is transferred to a known system - instead of being made accessible from any platform - one can often use a suitable character code conversion program. For example, when transferring text data from Windows to Macintosh, one can handle some of the characters discussed here, if one correctly converts from the Windows encoding to the Mac encoding.
The above-mentioned problems are much less serious than they used to. Generally, browsers on platforms other than Windows try to handle the characters discussed here, since they are actually used quite a lot on the Web. But new problems have emerged, such as the following:
The following table lists the characters we are discussing, i.e. the Windows characters which are not ISO Latin 1 characters. The Windows and ISO 10646 names as well as code numbers are given, Windows code in decimal and ISO 10646 code in hexadecimal. The column "# ref." contains the numeric character references (containing the Unicode code number in decimal) that can be used in HTML, but see warnings below.
| Windows name | ISO 10646 name of character | Win | Unicode | # ref. |
|---|---|---|---|---|
| baseline single quote | single low-9 quotation mark | 130 | U+201A
| ‚ |
| florin | Latin small letter f with hook | 131 | U+0192
| ƒ |
| baseline double quote | double low-9 quotation mark | 132 | U+201E
| „ |
| ellipsis | horizontal ellipsis | 133 | U+2026
| … |
| dagger | dagger | 134 | U+2020
| † |
| double dagger | double dagger | 135 | U+2021
| ‡ |
| circumflex accent | modifier letter circumflex accent | 136 | U+02C6
| ˆ |
| permile | per mille sign | 137 | U+2030
| ‰ |
| S Hacek | Latin capital letter S with caron | 138 | U+0160
| Š |
| left single guillemet | single left-pointing angle quot. m. | 139 | U+2039
| ‹ |
| OE ligature | Latin capital ligature OE | 140 | U+0152
| Œ |
| left single quote | left single quotation mark | 145 | U+2018
| ‘ |
| right single quote | right single quotation mark | 146 | U+2019
| ’ |
| left double quote | left double quotation mark | 147 | U+201C
| “ |
| right double quote | right double quotation mark | 148 | U+201D
| ” |
| bullet | bullet | 149 | U+2022
| • |
| endash | en dash | 150 | U+2013
| – |
| emdash | em dash | 151 | U+2014
| — |
| tilde accent | small tilde | 152 | U+02DC
| ˜ |
| trademark ligature | trade mark sign | 153 | U+2122
| ™ |
| s Hacek | Latin small letter S with caron | 154 | U+0161
| š |
| right single guillemet | single right-pointing angle quot. m. | 155 | U+203A
| › |
| oe ligature | Latin small ligature oe | 156 | U+0153
| œ |
| Y Dieresis | Latin capital letter Y with diaeresis | 159 | U+0178
| Ÿ |
Notes:
windows-1252. Unofficial
synonyms include cp-1252 and WinLatin1.
There are of course some reasons why the characters were are discussing were included into the "Windows character set" (as well to some other character repertoires). People who need a character tend to use it if they can. And many people are accustomed to using programs like MS Word where a large character repertoire is available. They usually just use any way of inserting special characters they need. (On MS Windows systems, a rather universal way of inserting the characters under discussion is the so-called Alt-nnnn method.) Normally they are satisfied when they see the characters presented on paper. So far so good.
The problem is that the internal encoding of the characters can
be interpreted in different ways if the data is transferred to or
processed in different programs and systems.
For instance, if you use on Windows Alt-0151 to insert an em dash
into a file and that file is transferred, without conversion,
to a
Unix system, anything may happen. Unix systems typically use
some ISO 8859 encoding nowadays, and that means that the octet (byte)
with value 151 in decimal is in the range reserved for control
characters. Problems may occur even if you don't transfer the file
to a different computer. If you use e.g. the type command
on the file
at the DOS level, you will see something like
ú (letter u with acute accent) instead of em dash!
On the Web, people use different browsers on different systems. Therefore, anything you put onto the Web is thereby "virtually" transferred to a huge variety of systems. Consequently, an HTML document for the Web should not contain anything that works on some operating systems only, no matter how common they are.
The problematic characters are often produced by different programs, such as HTML editors or converters. Naturally, they shouldn't behave that way, but many of them actually do. It's often a good idea to check that output from such tools does not contain any octets (bytes) in the range 128 - 159 decimal (200 - 237 octal). (A very simple C program could do that, for example.)
The following table summarizes the most common attempts to present in HTML the characters we discuss here. For concreteness, the table shows examples of presenting a particular character, the em dash.
| method | example | problems |
|---|---|---|
| "raw data" in windows-1252 | (octet with value 151 in decimal) | works quite often - when the data is interpreted as windows-1252 encoded |
| "raw data" in utf-8 | (octets that encode U+2014 in utf-8) | works often, but the entire document must be utf-8 encoded |
| character reference using Windows code | —
| undefined by specifications, but works rather often |
| entity reference | —
| works rather often (though not on Netscape 4) |
| correct character reference | —
| works very widely |
| an alternative correct character reference | —
| works somewhat less often than the decimal form |
| an image | <IMG SRC="mdash.gif" ALT="--">
| does not match the size of normal characters (except by accident); cf. to notes on using an image in The euro sign in HTML |
Presenting a character as "raw data" simply means that the character is presented as an octet (byte) or a sequence of octets according to the encoding used for the document. This is how most characters are actually presented in HTML documents. There is nothing mystical about it. (If you type characters from a keyboard using an editor, what normally happens is that you actually enter characters as "raw data" in some encoding; in some cases, you use some special methods for entering characters when they cannot be directly typed.) The problem with the "raw data" method for the characters discussed here is that it works only for those browsers (and other user agents) that can handle data in the specific encoding used. There is a very a large number of registered character encodings (and many unregistered encodings, too). One can hardly expect Web browsers generally handle whatever encoding an author has decided to use. In fact, the ISO 8859-1 encoding is the only encoding which can reasonably be expected to be known to any browser. Although the Windows encoding is very widely used, it might not understood by browsers running under other than Windows systems. On the other hand, browsers running in Windows environment usually treat documents according to the Windows encoding, if the server does not specify the encoding or if the encoding is specified to be ISO 8859-1.
In principle, if the "raw data" method is used, the server should
send an HTTP header which specifies that the encoding used.
When octets are to be interpreted according to the Windows encoding
(e.g. octet 151 means em dash), the server should send
Content-Type: text/html;charset=windows-1252
However, for reasons explained above, such headers usually don't
make browsers process the data any better than they would be default.
The problem with notations like —
is that their meaning is undefined, i.e. anything may happen.
In practice, users mostly see an em dash, but they might
alternatively see perhaps a space,
perhaps nothing - or perhaps the screen gets messed up.
After all, code positions 128 - 159
have been reserved
for eventual use as control codes ("control characters"),
and they might actually be used that way in some environments.
At present, almost all browsers support the following two methods, which are defined in the HTML 4.01 Specification:
&#n; where n
is the
code number, in decimal, of the character in
Unicode
and ISO 10646. (To use this method, you often need to convert
numbers from
hexadecimal notation to decimal, since the code numbers are given
in hexadecimal in most references. HTML 4.01 also allows a
character reference
using hexadecimal numbers, but it is less widely supported.)
&name; which is defined
for some characters, including those we
are discussing here. There is a handy reference
HTML
4.0 Entities by
WDG;
see its
section
on "Special Entities" for most of the characters discussed here.
So are the characters safe? I would estimate that in more than 99 % of browsing situations both the character references and entity references work well. The question really is: is the gain important enough to justify the potential problems caused to a small minority?
There are some cases where the use of typographically correct
characters is certainly justified.
If you need to include
e.g. Greek and Cyrillic letters onto one page, then
any
methods for using such a large character repertoire
(one of which is described in
my document
Using national and special characters in HTML )
at present considerably limit accessibility at present and in the
near future.
If you have good reasons to do so, then you should as well use
"smart quotes", em and en dashes, and other characters discussed here,
naturally using the method you have selected to solve the fundamental
problem. (When using "the most universal way" described in that
document of mine, you would use
— for em dash etc., using the Unicode
positions mentioned in the
list of characters above.)
For more detailed explanations of some of the problems, see ISO-8859 briefing and resources by Alan J. Flavell.
If you decide to use characters like em dashes, en dashes, and "smart" quotes, make sure you use them properly, according to the rules of the natural language you write. It's easy to go wrong here, since there have been breaks in typographic traditions, when those characters have been (and still largely are) avoided when producing texts on computers. For dashes in particular, see some usage notes in Dashes and hyphens.
For the em dash character in particular, different tricks have been suggested and used. This character looks so simple that people have thought that there must be a way to fool browsers into displaying something like that even if the character itself is not available.
As regards to the em dash in particular,
Andreas Prilop has mentioned an
interesting possibility:
<TT>-</TT>
(He also mentions
<FONT FACE="Symbol">-</FONT>; although
that might give an even wider glyph, it relies on the user's system having
a font with a particular name, whereas the
TT element
is universally supported.)
This particular method essentially consists of using a
hyphen (-) as surrogate for em dash but with a presentation
suggestion to display it using a font where the glyph for hyphen
is expected to be wider than a normal hyphen. Although it often
creates a good presentation, it has been said that in
the hyphen character of some monospace fonts looks bad especially
in the midst of normal text.
Yet another approach is to use two consecutive hyphens, with
a style sheet suggestion to reduce the spacing between them, hoping
that they will look like a dash.
This would apply to situations where "--" is an acceptable surrogate
for a dash. For some odd reason, Internet Explorer seems to be
immune to the style rule in this particular case, unless you
use the nobr markup. Here is
what your browser presents when the
construct
<nobr class="dash">--</nobr> is used together
with the style sheet
.dash { letter-spacing: -0.1em; }
Various other hacks have also been suggested, such as using a few
no-break spaces within a STRIKE element
to "construct" an em dash!
I have prepared a small
test file containing examples of
and annotations on
such attempts
as well the above-mentioned methods.
Whenever you need a character and can't use it, you need to consider substitutes. For the characters discussed here, relatively good substitutes can be found:
| Windows name | substitute | comments
|
|---|---|---|
| baseline single quote | ' | apostrophe used as single quote
|
| florin | <i>f</i> or NLG or gulden(s) | letter f in italics or the currency code or name
|
| baseline double quote | " | quotation mark (double quote)
|
| ellipsis | ... | three dots, possibly styled
|
| dagger | ¹ | superscript 1: ¹ (assuming use as footnote reference)
|
| double dagger | ² | superscript 2: ² (assuming use as footnote reference)
|
| circumflex accent | ^ | circumflex
|
| permile | o/oo | usual, but somewhat illogical
|
| S Hacek | Sh or SH | language-dependent
|
| left single guillemet | < or ' | "<"
used as "left angle bracket", or an apostrophe used as single quote
|
| OE ligature | Oe or OE | optionally styled; natural due to what "ligature" means |
| left single quote | ' | apostrophe used as single quote
|
| right single quote | ' | apostrophe used as single quote
|
| left double quote | " | quotation mark (double quote)
|
| right double quote | " | quotation mark (double quote)
|
| bullet | * or - or list markup | consider using <ul> and <li> markup instead
|
| endash | - | hyphen
|
| emdash | -- | two hyphens
|
| tilde accent | ~ or <sup>~</sup> | tilde ~, possibly in
superscript style:
~
|
| trademark ligature | <sup>(TM)</sup> | (TM) in superscript style: (TM)
|
| s Hacek | sh | language-dependent
|
| right single guillemet | > | ">"
used as "right angle bracket", or an apostrophe used as single quote
|
| oe ligature | oe | natural due to what "ligature" means |
| Y Dieresis | IJ or Y | depending
on intended meaning
|
Notes:
<style type="text/css"><!--
.ellip { letter-spacing: 0.08em; }
--></style>
and
<span class="ellip">...</span>
<SMALL><SUP>0</SUP></SMALL>/<SMALL><SUB>00</SUB></SMALL>
UL element
and it seems natural to apply this to material where em and en dash cannot be used. (For example, for the Finnish language, there is an official recommendation which deviates from English practice: a single hyphen is used as a replacement, in many cases surrounded by spaces.) You might consider using font-level markup to suggest that a hyphen used as a surrogate for dash be displayed in a particular font (to make it look more like a dash).In typewritten material, the em dash is represented by two hyphens with no space around them, and an en dash is represented by a hyphen.
H1
and
STRONG)
instead.
On the other hand, it
has been reported that y dieresis is used as a ligature for ij in Dutch.
This probably means that ÿ is used as a surrogate for a real
ij ligature
(which exists in
Unicode).
Thus, anyone intending to write a Dutch word using a ligature for IJ should
really type just IJ. (In situations where support to Unicode could be relied on,
the real IJ ligature, U+0132, could be used.)
The article Window[s] Characters and HTML, based on an early version of this document, was published in Boardwatch in June 2000. The tone of the current document is different, since support to the use of these characters has become essentially wider.
If you found this document useful, you might wish to check other documents on character problems in Web authoring by the same author.
Note to Finnish readers: Tämä dokumentti on laajennettu versio suomenkielisestä dokumentistani Mikrojen merkistöjen aiheuttamista ongelmista Webissä.