Dashes and hyphens

Content:

Preface

The use of computers for text processing has caused a decline in typography and orthography in many areas, in addition to undeniable benefits and advances. In particular, the character repertoire available for writing texts is still very often very limited. Quite often it’s just the “good old Ascii” set, or the somewhat larger ISO Latin 1 set. In that case, you cannot use even em or en dashes or “smart quotes”; instead, you use the Ascii characters hyphen (-), quotation mark ("), and apostrophe ('), at least if you wish to stay on the safe side. (See e.g. On the use of some MS Windows characters in HTML.) However, the situation slowly improves, so that contexts where you can use a richer character repertoire without risking too much become more common. Since the old rules have largely been forgotten or they were never learned by many people, and since there are some new (good) rules to note, it’s time to discuss how to use extended character repertoires properly. This document discusses specifically various dashes and hyphens—loosely speaking, those characters for which we have used Ascii hyphens as surrogates, in lack of anything better.

For obvious reasons, this document itself needs to take the risk of getting improperly displayed by browsers that do not support a rich enough character repertoire. After all, we need to give examples of the usage of the characters discussed. I decided to demonstrate also the use of proper quotation marks and apostrophes. This means that I have used, in HTML, the character reference “ for a left opening quotation mark (“), ” for a right quotation mark (”), and ‘ for a left single quotation mark (‘), and ’ for a right single quotation mark, which is also used as an apostrophe (’).

The Unicode dashes

In Unicode, there is a rather large collection of hyphen- or dash-like characters. Specifically, there is an official list [Unicode, chapter 6, Table 6-3], which is presented here as amended with some additional reference information. This table also contains the soft hyphen, which belonged to the corresponding table in Unicode 3 but is just mentioned after the table in the current version.

Unicode Dash Characters
glyph codes name notes on the meaning and usage
-U+002D- hyphen-minus the Ascii hyphen, with multiple usage, or “ambiguous semantic value”; the width should be “average”
~U+007E~ tilde the Ascii tilde, with multiple usage; “swung dash”
­U+00AD­ soft hyphen “discretionary hyphen”
֊U+058A֊ armenian hyphen as soft hyphen, but different in shape
U+1806᠆ mongolian todo hyphen as soft hyphen, but displayed at the beginning of the second line
U+2010‐ hyphen unambiguously a hyphen character, as in “left-to-right”; narrow width
U+2011‑ non-breaking hyphen as hyphen (U+2010), but not an allowed line break point
U+2012‒ figure dash as hyphen-minus, but has the same width as digits
U+2013– en dash used e.g. to indicate a range of values
U+2014— em dash used e.g. to make a break in the flow of a sentence
U+2015― horizontal bar used to introduce quoted text in some typographic styles; “quotation dash”; often (e.g., in the representative glyph in the Unicode standard) longer than em dash
U+2053⁓ swung dash like a large tilde; often missing in fonts
U+207B⁻ superscript minus a compatibility character which is equivalent to minus sign U+2212 in superscript style
U+208B₋ subscript minus a compatibility character which is equivalent to minus sign U+2212 in subscript style
U+2212− minus sign an arithmetic operator; the glyph may look the same as the glyph for a hyphen-minus, or may be longer ;
U+301C〜 wave dash a Chinese/Japanese/Korean character
U+3030〰 wavy dash a Chinese/Japanese/Korean character

The first column above may not actually display the glyph correctly, depending on your browser and on the fonts available on your system.

The notes in the table above are not from Table 6-3 of the standard but reflect the statements in the standard elsewhere. It is questionable why all those, and exactly those, characters are listed as “Dash Characters” there. There is nothing particularly hyphen-like or dash-like in the tilde character, for example. There is also a character that is characterized as a nonbreaking hyphen (despite not being very hyphen-like in appearance) but not listed in Table 6-3:

There are two nonbreaking hyphen characters in the Unicode standard: U+2011 non-breaking hyphen and U+0F0C tibetan mark delimiter tsheg bstar.
[Unicode, chapter 15]

The swung dash character was added in Unicode 4, and most fonts do not contain it. The phrase “swung dash” normally means a character used, for brevity, in dictionaries to stand for a word or part of word that occurred previously. In appearance, it is like a large version of tilde ~, and the tilde has often been used in the role of a swung dash, as the alternate name of tilde suggests.

The character hyphen bullet U+2043 is not listed among the dash characters, and there is no cross reference in the description of the hyphen bullet in the code chart. It seems that the hyphen bullet is really meant to be a bullet character that looks like a hyphen (of a kind), rather than comparable to hyphens and dashes.

Some typographic recommendations mention the use of a three-em dash, especially in bibliographies to indicate that a cited work is written by the same authors as the preceding entry. There is no single character for 3 em dash in Unicode, but you can use three consecutive em dash characters. Beware, however, that not all fonts implement the em dash in a manner that makes it join continuously with an adjacent em dash, as needed for a good use of the typographic convention. For example, Times New Roman and Arial have “joining” em dash but Georgia does not.

Hyphens and dashes in line breaking rules

For a general discussion, please refer to Unicode line breaking rules: explanations and criticism.<(p> In the Unicode standard, there are some special notes on line breaking behavior with respect to dashes and hyphens, in chapter 6 and chapter 15.

In particular, there is a rather confusing explanation of the soft hyphen:

Hyphenation. U+00AD soft hyphen (SHY) indicates an intraword break point, where a line break is preferred if a word must be hyphenated or otherwise broken across lines. Such break points are generally determined by an automatic hyphenator. The use of SHY is generally limited to situations where users need to override the behavior of such a hyphenator. The visible rendering of a line break at an intraword break point, whether automatically determined or indicated by a SHY, depends on the surrounding characters, the language, and, at times, the meaning of the word. The precise rules are outside the scope of this standard, but see Unicode Standard Annex #14, “Line Breaking Properties,” for additional information. A common default rendering is to insert a hyphen before the line break, but this is incorrect in many situations.

The Unicode Standard Annex #14, Line Breaking Properties, contains most of the information on line breaking in the standard. Note that the annex is a part of the standard. It’s really technical, and the properties assigned to individual characters are in a large data file, so I have composed a summary table.

Line breaking properties for Unicode dash characters
glyph code name line breaking property class
-U+002D hyphen-minus HY, Hyphen: provide a line break opportunity after the character, except in numeric context
~U+007E tilde AL, Ordinary Alphabetic and Symbol Characters
­U+00AD soft hyphen BA, Break Opportunity After: generally provide a line break opportunity after the character
֊U+058A armenian hyphen BA, Break Opportunity After
U+1806 mongolian todo hyphen BB, Break Opportunity Before: generally provide a line break opportunity before the character
U+2010 hyphen BA, Break Opportunity After
U+2011 non-breaking hyphen GL, Non-breaking (“Glue”): prohibit line breaks before or after
U+2012 figure dash BA, Break Opportunity After
U+2013 en dash BA, Break Opportunity After
U+2014 em dash B2, Break Opportunity Before and After
U+2015 horizontal bar AL, Ordinary Alphabetic and Symbol Characters
U+2053 swung dash AL, Ordinary Alphabetic and Symbol Characters
U+207B superscript minus AL, Ordinary Alphabetic and Symbol Characters
U+208B subscript minus AL, Ordinary Alphabetic and Symbol Characters
U+2212 minus sign PR, Prefix (Numeric): don’t break in front of a numeric expression
U+301C wave dash NS, Non Starter: allow only indirect line break before
U+3030 wavy dash ID, Ideographic: break before or after

The descriptions of the line breaking property classes listed above are from Table 1 of the report. The exact meanings are specified by the more exact rules (partly formalized, partly in English) there. Specifically, the recommended rules include the following:

Note however that the report also says: “Higher level protocols may further restrict, override, or extend the line breaking properties of certain characters in some contexts”. On the other hand, the quality of programs that do line division varies greatly, and the guidelines in the report should be regarded as proposed principles for future software rather than descriptions of current practice. A good formatting algorithm will not e.g. blindly split a word after a hyphen even if it results in a single character from the word to appear at the start of a line, as Internet Explorer does. Note that such behavior, which occurs in MS Word too, may affect expressions like “-s” (as in “the normal plural suffix in English is -s” too, so it would be safest to use nonbreaking hyphens in such cases, if sufficiently rich character repertoire can be used reliably. — In MS Word, you can produce a nonbreaking hyphen by Ctrl Shift - (or, to describe it differently, Ctrl _).

MS Word specialities

Although all reasonably new versions of MS Word support Unicode, there are many peculiarities and oddities in the way it handles Unicode characters. In particular, Word has an Insert/Symbol function where you can insert a character either by picking it up from a table (pane “Symbols”) or by using a quick menu for some commonly used characters (pane “Special Characters”). Some entries in the latter are rather misleading.

In the “Special Characters” menu,

However, when saving data in HTML format, Word 2002 generates &#8209; from its internal “Nonbreaking Hyphen” and the U+00AD soft hyphen from its internal “Optional Hyphen”.

It is possible to insert U+2011 or U+00AD e.g. using the “Symbols” pane or, in sufficiently new systems, by typing 2011 Alt-x or ad Alt-x, respectively. The non-breaking hyphen U+2011 then works properly, assuming the font in use contains a glyph for it. The soft hyphen U+00AD however is displayed as a visible hyphen.

Typographic usage

When a sufficient character repertoire is available, the following usage rules are suitable, since they comply with old typographic and orthographic principles and the defined Unicode meanings of characters:

Especially the en dash and em dash have language-dependent uses. The uses mentioned above (as taken from the Unicode standard) should primarily be taken as typical uses in American English. The detailed rules of their usage are obviously orthographical (and stylistic or typographic) rather than character code standard issues.

The Unicode standard mentions:

U+2013 en dash is used to indicate a range of values, such as 1973–1984. It should be distinguished from U+2212 minus, which is an arithmetic operator; however, typographers have typically used U+2013 en dash in typesetting to represent the minus sign. … In older mathematical typography, U+2014 em dash is also used to indicate a binary minus sign.
[Unicode, chapter 6]

One might conclude from this that if the minus sign cannot be used but the en dash is available (e.g., when the character repertoire is limited to the so-called Windows character set), the en dash is a better surrogate for the minus sign than the hyphen-minus or the em dash.

Hyphens and dashes: a closer look at English usage

Punctuation style varies according to language, style, and even authors’ personal preferences. The use of hyphens and dashes in literary American English seems to be relatively uniform however. The following description is mostly based on a detailed explicit presentation in a style manual [Webster], which seems to reflect the actual practices in high-quality printed publications rather well. However, note that some stylistic usages do not make a distinction between an em dash and an en dash. Simple punctuation rules [Oxford] might just refer to a dash in general. In fact, the situation has been described [Typographical] so that “both European and Anglo saxon typesetters do in fact separate words by close to a full em length in this situation, but the European style is to leave a bit of white space around the (shorter) dash while the Anglo saxon style is to cover the full em length with a correspondingly longer dash instead.”

Hyphen

Hyphens are basically used inside words to separate their parts from each other. This includes using it between the components of a compound word, often with variation so that the word might also be spelled without the hyphen or as two distinct words. Somewhat similar usage is the use of a hyphen to combine an abbreviation with a suffix, as in D.H.-ing or AA-er, though the apostrophe is more commonly used for such purposes.

An important and well-known use is at the end of a line, to mark that a word has been split. The basic spelling might have a hyphen in the same position; the difference between such a case and a hyphen introduced by a formatting algorithm is not indicated (visually or by the choice of the hyphen character), although the ISO 8859 standards can be interpreted as defining the difference between a normal hyphen-minus and a soft hyphen for such a purpose.

Other uses are more casual. A hyphen is, especially linguistics, used are to indicate that a sequence of letters is a prefix or suffix or otherwise part of a word rather than a word of its own, as in “the plural suffix -en is very rare in English”. Hyphens might also be used to indicate possible hyphenation points or syllable structure of a word, though there are many other notations for this in dictionaries. Very casually, hyphens can be used to indicate stuttering, sobbering, or halting speech, as in y-y-es, or to indicate a word spelled out letter by letter, e.g. p-h-l-e-g-m.

Finally, hyphens are often used as surrogates for other characters, such as en dash. However, this is caused by the insufficiency of the available character repertoire and does not belong to optimal orthography.

En dash

The en dash is used to indicate an interval of some sort. This might mean a range, as in “ages 10–15”, or a route, as in “Chicago–Memphis train”. Generally, the en dash thus means “to”.

Note that even if the real en dash character is used, confusion with the minus sign is possible, at least in principle, since the two characters look similar and there is no standard width for either of them. For such reasons, it has been recommended [SI Guide, section 7.7] that the word “to” be used instead when denoting numeric ranges, as in “0 V to 5 V” (instead of “0 V–5 V”). Obviously, in other languages other expressions need to be used. Note that not all languages have handy prepositions like “to”. One might consider using two or three dots, though then there might be confusion with the decimal point.

Various other usages are also known, such as using the en dash in expressions like “pre–Civil war” or “shock-wave–boundary-layer interaction”. In such usage, a hyphen would normally be used, but since a part of a compound phrase is itself hyphenated or consists of several words, some authors use an en dash instead, in the role of a hyphen with a “scope” different from normal. Some authors use the en dash for any compound attribute where the parts have equal weight; thus, they would use a hyphen for “big-boned woman” or “high-altitude test”, but they would use an en dash for “true–false test” or “question–answer format”. Such usages could be seen as conflicting with the normal distinction between a hyphen and an en dash, however. The same applies to the use of an en dash in place of a hyphen in all capital text [Caskill].

Em dash

The em dash is a multiple-use punctuation symbol, but it basically separates major parts of a statement, as opposite to the hyphen and the en dash, which have more “local”, separative functions.

The uses of the em dash can be classified as follows:

  1. abrupt change—something unexpected follows after this punctuation character
  2. abrupt termination, to indicate that the flow of speech ends unnaturally
  3. parenthetic remark—like this—which might be seen as a special case of an abrupt change followed by a return (in a sense, an abrupt change too) to the main flow of thought
  4. in quotations, an em dash can be used before the name of an author or other citation; this too can be seen as an abrupt change, from quoted text to attributions
  5. in enumerations, as alternative to a list bullet.

For parenthetic remarks, em dashes are common in literary usage, whereas scientific usage favors parentheses. It is also possible to use a style that distinguishes between different parenthetic remarks [Caskill]:

In some usage, “long dashes” are used: “a two em dash” might indicate missing letters in a word (or sometimes a missing word); whereas “a three em dash” might indicate that a word or phrase has been left out. There are no Unicode characters for such usage. More naturally, such usages involve two or three em dash characters in succession (—— or ———); in typical fonts, the dashes appear consecutively, creating the impression of a long dash. However, browsers may divide such a construct into two lines, even though Unicode line breaking rules explicitly forbid a line break between two em dashes. Thus, you may consider using nobr markup around them, or some other method to prevent undesired line breaks.


References

[Caskill] Mary K. McCaskill: Grammar, Punctuation, and Capitalization; A Handbook for Technical Writers and Editors, NASA SP-7084. http://stipo.larc.nasa.gov/sp7084/ (accessed 2000-06-05).

[Oxford] Oxford Advanced Learner’s Dictionary, Fourth Edition, p. 1518, ISBN 0 19 431110 4.

[SI Guide] Guide for the Use of the International System of Units (SI). U.S. Commerce Departments Technology Administration, National Institute of Standards and Technology (NIST). URL: http://physics.nist.gov/Pubs/SP811/sp811sl.pdf (accessed 2000-06-06).

[Typographical] Typographical measurement systems. By the CSS Pointers Group. URL: http://css.nu/articles/typograph1-en.html (accessed 2002-06-29).

[Unicode] The Unicode Standard Version 4.0. The Unicode Consortium. URL: http://www.unicode.org/versions/Unicode4.0.0/ (accessed 2005-03-31)
Version 3.0 is also available: http://www.unicode.org/unicode/uni2book/u2.html
For information on reading the standard, see my Guide to the Unicode standard.

[Webster] Webster’s Style Manual, in p. 1323–1395 of Webster’s New Encyclopedic Dictionary, ISBN 0-9637056-0-1.