Internet E-mail address format (RFC 822) explained

RFC 822 has been superseded by RFC 2822. The changes are probably small, but I haven't yet checked whether they affect the content of this document.

The RFC which defines the Internet E-mail ("electronic mail") address is RFC 822, titled Standard for the format of ARPA Internet text messages, is one of the oldest and most fundamental Internet standards (registered as STD 11). This document explains the address format defined in section 6 Addressing, as officially modified by clauses 5.2.15 and 5.2.16 of RFC 1123.

Basic definitions for address syntax

Address

     address     =  mailbox                      ; one addressee
                 /  group                        ; named list

Legend: The solidus / indicates alternative (or). In definitions like this, taken verbatim from RFC 822, the semicolon (;) opens an explanatory statement (comment) which is not part of the formal definition.

Normally an address is a mailbox, or a simple address. It could also be a group specification, though this possibility is rarely used. Note that the word mailbox is very often used (outside RFC 822) to refer to a file to which a system's E-mail software appends any incoming E-mail sent to an address (normally, a user). So there's a connection, but in RFC 822, mailbox is a syntactic and logical term which identifies a recipient rather than a store or a set of messages.

Group

     group       =  phrase ":" [#mailbox] ";"

Legend: The brackets [] indicate optionality, i.e. the parts enclosed in them can be present, or can be omitted. Quotation marks surround literals which must appear exactly as written (without the quotation marks of course). The number sign # is a prefix that indicates that the construct following it may be repeated any number of times, using commas as delimiters; thus, #mailbox means any number (>0) of mailboxes separated by commas.

This would allow an address like
foo:a@b.example,c@d.example,e@f.example;
But I don't think I've ever seen it used - perhaps I just didn't notice. Mailing lists are more commonly used, and a mailing list address could appear as syntactically just one address (mailbox).

Mailbox

     mailbox     =  addr-spec                    ; simple address
                 /  [phrase] route-addr            ; name & addr-spec

     route-addr  =  "<" [route] addr-spec ">"

A mailbox can be just an address specification (addr-spec), but it could also be such a specification enclosed between "<" and ">", in which case it can be specified by a comment-like phrase, such as the user's real name. The syntax also allows a route specification in the latter case, but this is rarely used nowadays.

Note that there are two ways to add information like a real name (say Jukka Korpela) to an address (say jkorpela@cc.hut.fi):

Route

     route       =  1#("@" domain) ":"           ; path-relative

Legend: When the number sign # prefix is preceded by a number, it indicates that the construct following it may be repeated any number of times, using commas as delimiters, but must occur at least once. Parentheses indicate just grouping here and must not occur in the actual data.

Address specification

     addr-spec   =  local-part "@" domain        ; global address

This is what "Internet E-mail address" normally means. If you are asked to tell your E-mail address, this is what people want you to tell; they may add some comment-like stuff to it when they use it e.g. to send E-mail to you.

Local part

     local-part  =  word *("." word)             ; uninterpreted
                                                 ; case-preserved

Legend: The asterisk * prefix indicates that the construct following it may be occur any number of times (but need not occur at all). Thus, local-part is a sequence of one or more words separated with full stops (dots, periods), such as jkoo or "Jukka Korpela" or Jukka.Korpela or just.an.example.you.know. (As explained below, the use quotation marks turns anything to a word, in the syntactic sense that is relevant here.)

RFC discusses the meaning of a local-part as follows:

The local-part of an addr-spec in a mailbox specification (i.e., the host's name for the mailbox) is understood to be whatever the receiving mail protocol server allows. For example, some systems do not understand mailbox references of the form "P. D. Q. Bach", but others do.

This specification treats periods (".") as lexical separators. Hence, their presence in local-parts which are not quoted-strings, is detected. However, such occurrences carry no semantics. That is, if a local-part has periods within it, an address parser will divide the local-part into several tokens, but the sequence of tokens will be treated as one uninterpreted unit. The sequence will be re-assembled, when the address is passed outside of the system such as to a mail protocol service.

Within a domain, local-parts with periods are often used and processed in a uniform way, e.g. using firstname.lastname structure. The point in the text quoted above is that all such conventions depend on the local arrangements, and E-mail processing software just passes the local-part as such to the recipient system. The sender's software has no way of knowing what the recipient system will do with the local-part. And an official amendment to RFC 822 clarifies:

A host that is forwarding the message but is not the destination host implied by the right-hand side "domain" must not interpret or modify the "local-part" of the address.

Domain

     domain      =  sub-domain *("." sub-domain)

     sub-domain  =  domain-ref / domain-literal

     domain-ref  =  atom                         ; symbolic reference

Thus, syntactically, domain is a sequence of one or more words separated with full stops (dots, periods), such as foo.bar.zap.example or cc.hut.fi or hut.fi.

Domain-literals will not be discussed here. They allow a domain be specified by its numeric (IP) address, e.g. [10.0.3.19]. Syntactically, a domain literal consists of bracketed string of characters, with some limitations on the character repertoire. The use of domain literals has always been strongly discouraged in RFC 822.

Basically, a domain part in an E-mail address is the hierarchical Internet domain name, with the top-level domain on the right. To make things work, the top-level domain names must be registered in a centralized manner and publicly; names directly under each domain (subdomains) must be registered by the authority for that domain; etc. Note that a domain name might refer to a particular computer, and often does, but it need not. Quite often domain names reflect some administrative hierarchy; for example, cs.hut.fi is the domain of the Computer Science laboratory of the Helsinki University of Technology, Finland. See the discussion of domain semantics in RFC 822 for more information.

Lower-level syntactic constructs

The terms "word", "atom", and "phrase" have been used above, but not syntactically defined here yet. The syntax as presented in section 3.3 Lexical tokens in RFC 822is a bit complicated, so here we give a plain English description.

A phrase is a word or a sequence of words.

An word is either an atom or a quoted string.

An atom is a sequence of printable ASCII characters except space or any of the following:
()<>@,;:\".[]
Positively speaking, this means that the valid constituents of an atom are the following:

!"#$%&'*+-/0123456789=?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ^_
`abcdefghijklmnopqrstuvwxyz{|}~

A quoted string is formed by using normal ASCII quotation marks (") around a string, and it's a way of turning almost any string syntactically to a word. This means for example that a string containing a space (say, Jukka Korpela) becomes acceptable when quoted, in a context where the syntax requires a word. A quoted string may contain any ASCII character, but quotation mark (") or carriage return (CR control code) must be preceded by a reverse solidus (backslash, \), and the reverse solidus itself as a character must be written as doubled (\\).

Note that RFC 822 limits the character repertoire to ASCII. In practice, other characters (such as ä or é) usually work inside quoted strings used for commenting purposes (and comments), but they must not be used in addresses proper.


Note to Finnish readers: Olen laatinut suomenkielisen tiivistelmän RFC 822:sta muutoksineen.

Date of last update: 2001-05-02. A minor technical correcion 2014-04-25.

Jukka Korpela