The character commonly known as hyphen originated in early typewriters. The character repertoire had to be kept small, so several characters such as hyphen, en dash, em dash, and minus sign were lumped together. In modern character code standards, the character is called “hyphen-minus” to reflect its ambiguity, but it’s really more ambiguous than the name suggests. It is best to call it “Ascii hyphen”.
This document discusses various dashes and hyphens—loosely speaking, those characters for which we have used Ascii hyphens as surrogates, in lack of anything better.
there is a rather large collection of hyphen- or dash-like
characters. Specifically, there is an official list
|glyph||codes||name||notes on the meaning and usage|
|-||U+002D||-||hyphen-minus||the Ascii hyphen, with multiple usage, or “ambiguous semantic value”; the width should be “average”|
|~||U+007E||~||tilde||the Ascii tilde, with multiple usage; “swung dash”|
|||U+00AD||­||soft hyphen||“discretionary hyphen”|
|֊||U+058A||֊||armenian hyphen||as soft hyphen, but different in shape|
|־||U+05BE||־||hebrew punctuation maqaf||word hyphen in Hebrew|
|᐀||U+1400||᐀||canadian syllabics hyphen||used in Canadian Aboriginal Syllabics|
|᠆||U+1806||᠆||mongolian todo soft hyphen||as soft hyphen, but displayed at the beginning of the second line|
|‐||U+2010||‐||hyphen||unambiguously a hyphen character, as in “left-to-right”; narrow width|
|‑||U+2011||‑||non-breaking hyphen||as hyphen (|
|‒||U+2012||‒||figure dash||as hyphen-minus, but has the same width as digits|
|–||U+2013||–||en dash||used e.g. to indicate a range of values|
|—||U+2014||—||em dash||used e.g. to make a break in the flow of a sentence|
|―||U+2015||―||horizontal bar||used to introduce quoted text in some typographic styles; “quotation dash”; often (e.g., in the representative glyph in the Unicode standard) longer than em dash|
|⁓||U+2053||⁓||swung dash||like a large tilde|
compatibility character which is equivalent to
minus sign |
compatibility character which is equivalent to
minus sign |
|−||U+2212||−||minus sign||an arithmetic operator; the glyph may look the same as the glyph for a hyphen-minus, or may be longer ;|
|⸗||U+2E17||⸗||double oblique hyphen||used in ancient Near-Eastern linguistics; not in Fraktur, but the glyph of Ascii hyphen or hyphen is similar to this character in Fraktur fonts|
|⸺||U+2E3A||⸺||two-em dash||omission dash<(a>, 2 em units wide
||used in bibliographies, 3 em units wide
||〜||U+301C||〜 ||wave dash
||a Chinese/Japanese/Korean character||〰||U+3030||〰 ||wavy dash
||a Chinese/Japanese/Korean character||゠||U+30A0||゠|| katakana-hiragana double hyphen||in Japasene kana writing
||︱||U+FE31||︱|| presentation form for vertical em dash||vertical variant of em dash
||︲||U+FE32||︲|| presentation form for vertical en dash||vertical variant of en dash
||﹘||U+FE58||﹘|| small em dash||small variant of em dash
||﹣||U+FE63||﹣|| small hyphen-minus||small variant of Ascii hyphen
||－||U+FF0D||－|| fullwidth hyphen-minus||variant of Ascii hyphen for use with CJK characters
The first column above may not actually display the glyph correctly, depending on your browser and on the fonts available on your system.
The notes in the table above are not from
On the other hand, hyphen with diaeresis is not included in the table, although it has the General Category value of Pd (Punctuation, Dash).
There is also a character that is characterized
as a nonbreaking hyphen
(despite not being very hyphen-like in appearance)
but not listed in Table
The Unicode standard includes two nonbreaking hyphen characters: U+2011 non-breaking hyphen and U+0F0C tibetan mark delimiter tsheg bstar.
The swung dash character was added in Unicode 4, and most fonts do not contain it. The phrase “swung dash” normally means a character used, for brevity, in dictionaries to stand for a word or part of word that occurred previously. In appearance, it is like a large version of tilde ~, and the tilde has often been used in the role of a swung dash, as the alternate name of tilde suggests.
The character hyphen bullet U+2043 is not listed among the dash characters, and there is no cross reference in the description of the hyphen bullet in the code chart. It seems that the hyphen bullet is really meant to be a bullet character that looks like a hyphen (of a kind), rather than comparable to hyphens and dashes.
For a general discussion on line breaking, please refer to Unicode line breaking rules: explanations and criticism. The Unicode Standard Annex #14, Line Breaking Properties, contains most of the information on line breaking in the standard. Note that the annex is a part of the standard. It’s really technical, and the properties assigned to individual characters are in a large data file, so I have composed a summary table.
|glyph||code||name||line breaking property class|
|-||U+002D||hyphen-minus||HY, Hyphen: provide a line break opportunity after the character, except in numeric context|
|~||U+007E||tilde||AL, Ordinary Alphabetic and Symbol Characters|
|||U+00AD||soft hyphen||BA, Break Opportunity After: generally provide a line break opportunity after the character|
|֊||U+058A||armenian hyphen||BA, Break Opportunity After|
|᠆||U+1806||mongolian todo hyphen||BB, Break Opportunity Before: generally provide a line break opportunity before the character|
|‐||U+2010||hyphen||BA, Break Opportunity After|
|‑||U+2011||non-breaking hyphen||GL, Non-breaking (“Glue”): prohibit line breaks before or after|
|‒||U+2012||figure dash||BA, Break Opportunity After|
|–||U+2013||en dash||BA, Break Opportunity After|
|—||U+2014||em dash||B2, Break Opportunity Before and After|
|―||U+2015||horizontal bar||AL, Ordinary Alphabetic and Symbol Characters|
|⁓||U+2053||swung dash||AL, Ordinary Alphabetic and Symbol Characters|
|⁻||U+207B||superscript minus||AL, Ordinary Alphabetic and Symbol Characters|
|₋||U+208B||subscript minus||AL, Ordinary Alphabetic and Symbol Characters|
|−||U+2212||minus sign||PR, Prefix (Numeric): don’t break in front of a numeric expression|
|〜||U+301C||wave dash||NS, Non Starter: allow only indirect line break before|
|〰||U+3030||wavy dash||ID, Ideographic: break before or after|
The descriptions of the line breaking property classes listed above are from Table 1 of the report. The exact meanings are specified by the more exact rules (partly formalized, partly in English) there. Specifically, the recommended rules include the following:
Note however that the report also says: “Higher level protocols may further restrict, override, or extend the line breaking properties of certain characters in some contexts”. On the other hand, the quality of programs that do line division varies greatly, and the guidelines in the report should be regarded as proposed principles for future software rather than descriptions of current practice. A good formatting algorithm will not e.g. blindly split a word after a hyphen even if it results in a single character from the word to appear at the start of a line, as Internet Explorer does. Note that such behavior, which occurs in MS Word too, may affect expressions like “-s” (as in “the normal plural suffix in English is -s” too, so it would be safest to use nonbreaking hyphens in such cases, if sufficiently rich character repertoire can be used reliably.
Although all reasonably new versions of MS Word support Unicode, there are many peculiarities and oddities in the way it handles Unicode characters. In particular, Word has an Insert/Symbol function where you can insert a character either by picking it up from a table (pane “Symbols”) or by using a quick menu for some commonly used characters (pane “Special Characters”). Some entries in the latter are rather misleading.
In the “Special Characters” menu,
However, when saving data in HTML format, Word 2002 generates ‑ from its internal “Nonbreaking Hyphen” and the U+00AD soft hyphen from its internal “Optional Hyphen”.
It is possible to insert U+2011 or U+00AD e.g. using the “Symbols” pane or, in sufficiently new systems, by typing 2011 Alt-x or ad Alt-x, respectively. The non-breaking hyphen U+2011 then works properly, assuming the font in use contains a glyph for it. The soft hyphen U+00AD however is displayed as a visible hyphen.
When a sufficient character repertoire is available, the following usage rules are suitable, since they comply with old typographic and orthographic principles and the defined Unicode meanings of characters:
a = b - c;
Especially the en dash and em dash have language-dependent uses. The uses mentioned above (as taken from the Unicode standard) should primarily be taken as typical uses in American English. The detailed rules of their usage are obviously orthographical (and stylistic or typographic) rather than character code standard issues.
The Unicode standard mentions:
U+2013 en dash is used to indicate a range of values, such as 1973–1984. It should be distinguished from U+2212 minus, which is an arithmetic operator; however, typographers have typically used U+2013 en dash in typesetting to represent the minus sign. … In older mathematical typography, U+2014 em dash is also used to indicate a binary minus sign.
One might conclude from this that if the minus sign cannot be used but the en dash is available (e.g., when the character repertoire is limited to the so-called Windows character set), the en dash is a better surrogate for the minus sign than the hyphen-minus or the em dash.
Punctuation style varies according to language, style, and even authors’ personal preferences. The use of hyphens and dashes in literary American English seems to be relatively uniform however. The following description is mostly based on a detailed explicit presentation in a style manual [Webster], which seems to reflect the actual practices in high-quality printed publications rather well. However, note that some stylistic usages do not make a distinction between an em dash and an en dash. Simple punctuation rules [Oxford] might just refer to a dash in general. In fact, the situation has been described so that “both European and Anglo saxon typesetters do in fact separate words by close to a full em length in this situation, but the European style is to leave a bit of white space around the (shorter) dash while the Anglo saxon style is to cover the full em length with a correspondingly longer dash instead.” [Typographical]
Hyphens are basically used inside words to separate their parts from each other. This includes using it between the components of a compound word, often with variation so that the word might also be spelled without the hyphen or as two distinct words. Somewhat similar usage is the use of a hyphen to combine an abbreviation with a suffix, as in D.H.-ing or AA-er, though the apostrophe is more commonly used for such purposes.
An important and well-known use is at the end of a line, to mark that a word has been split. The basic spelling might have a hyphen in the same position; the difference between such a case and a hyphen introduced by a formatting algorithm is not indicated (visually or by the choice of the hyphen character), although the ISO 8859 standards can be interpreted as defining the difference between a normal hyphen-minus and a soft hyphen for such a purpose.
Other uses are more casual. A hyphen is, especially linguistics, used are to indicate that a sequence of letters is a prefix or suffix or otherwise part of a word rather than a word of its own, as in “the plural suffix -en is very rare in English”. Hyphens might also be used to indicate possible hyphenation points or syllable structure of a word, though there are many other notations for this in dictionaries. Very casually, hyphens can be used to indicate stuttering, sobbering, or halting speech, as in y-y-es, or to indicate a word spelled out letter by letter, e.g. p-h-l-e-g-m.
Finally, hyphens are often used as surrogates for other characters, such as en dash. However, this is caused by the insufficiency of the available character repertoire and does not belong to optimal orthography.
The en dash is used to indicate an interval of some sort. This might mean a range, as in “ages 10–15”, or a route, as in “Chicago–Memphis train”. Generally, the en dash thus means “to”.
Note that even if the real en dash character is used, confusion with the minus sign is possible, at least in principle, since the two characters look similar and there is no standard width for either of them. For such reasons, it has been recommended [SI Guide, section 7.7] that the word “to” be used instead when denoting numeric ranges, as in “0 V to 5 V” (instead of “0 V–5 V”). Obviously, in other languages other expressions need to be used. Note that not all languages have handy prepositions like “to”. One might consider using two or three dots, though then there might be confusion with the decimal point.
Various other usages are also known, such as using the en dash in expressions like “pre–Civil war” or “shock-wave–boundary-layer interaction”. In such usage, a hyphen would normally be used, but since a part of a compound phrase is itself hyphenated or consists of several words, some authors use an en dash instead, in the role of a hyphen with a “scope” different from normal. Some authors use the en dash for any compound attribute where the parts have equal weight; thus, they would use a hyphen for “big-boned woman” or “high-altitude test”, but they would use an en dash for “true–false test” or “question–answer format”. Such usages could be seen as conflicting with the normal distinction between a hyphen and an en dash, however. The same applies to the use of an en dash in place of a hyphen in all capital text [Caskill].
The em dash is a multiple-use punctuation symbol, but it basically separates major parts of a statement, as opposite to the hyphen and the en dash, which have more “local”, separative functions.
The uses of the em dash can be classified as follows:
For parenthetic remarks, em dashes are common in literary usage, whereas scientific usage favors parentheses. It is also possible to use a style that distinguishes between different parenthetic remarks [Caskill]:
- Commas (most frequently used) indicate only a slight separation in thought from the rest of the sentence.
- Dashes emphasize the element enclosed and clarify meaning when the element contains internal commas.
- Parentheses indicate that the enclosed element is only loosely connected to the rest of the sentence and therefore tend to de-emphasize it.
“long dashes” are used:
“a two em dash” might indicate missing letters
in a word (or sometimes a missing word); whereas
“a three em dash” might indicate
that a word or phrase has been left out.
The latter is often used in
in bibliographies to
indicate that a cited work is written by the same
nobr markup around them, or some other
method to prevent undesired line breaks.)
In Unicode version 6.1 (2012), the three-em dash character was added, together with two-em dash. (See discussion on 2-em and 3-em dashes in the Unicode mailing list in January 2010.) It typically takes many years before a character added to Unicode becomes generally available in fonts, and these characters are probably no exceptions.
Mary K. McCaskill:
Grammar, Punctuation, and Capitalization;
A Handbook for Technical Writers and Editors,
Oxford Advanced Learner’s Dictionary,
Fourth Edition, p. 1518,
Guide for the Use of the International System of Units (SI).
U.S. Commerce Departments Technology Administration,
National Institute of Standards and Technology
Typographical measurement systems.
By Jan Roland Eriksson.
The Unicode Standard. The Unicode Consortium.
For information on reading the standard, see my
Guide to the Unicode standard.
Webster’s Style Manual, in p. 1323–1395 of
Webster’s New Encyclopedic Dictionary,