Why tilde (~) should not be used in Web addresses (URLs)

If used in Web addresses (URLs), the tilde character (~) should be encoded (as %7e or %7E). Although in most cases things work if you violate this, there is no reason to do so, since well-defined, universally working alternatives exist. This document, in addition to describing the issue in principle, also discusses the different practical problems that may arise when tilde is used in Web addresses.

What the specifications say

In the long-standing RFC on URL format, RFC 1738, there was an explicit requirement that any occurrence of the tilde (~) character in a Web page address (URL, a.k.a. URI) shall be encoded as %7e or, equivalently, as %7E. (For example, http://www.hut.fi/~jkorpela/ was thus incorrect, while http://www.hut.fi/%7ejkorpela/ was and is syntactically correct.)

In a new RFC, namely RFC 2396, some requirements have relaxed. In particular, tilde and some other characters have now been declared as "safe", thereby not requiring encoding.

However, the encoded notation is still a valid alternative and works more reliably. It's not so much of a matter of old networking software; the tilde character causes problems to other software which is used to process documents - and to human readers.

For a short summary of URL format, including the encoding mechanism, see section URLs in my Learning HTML 3.2 by Examples.

The reasons

RFC 1738 explains (in clause 2.2) the reasons for the encoding requirement very briefly. It mentions tilde among those characters which are classified as "unsafe", because "gateways and other transport agents are known to sometimes modify such characters". Some people argue that such problems no more exist in practice. And it is true that probably the great majority of programs directly related to Web browsing (such as browsers and servers) can handle tilde.

However, tilde is still problematic When did you last see a correctly cited URL in your local newspaper? It's almost hopeless when journalists write them by hand. In my experience, they get tildes wrong more than half of the time. To describe the problems more systematically, here is a list:

Is the %7e solution really good?

Of course, the notation %7e is mystical to most people. Since it looks cryptic, it can easily be misread, misremembered, or mistyped. In a Usenet article, Warren Steel first gives some examples of how unescaped ~ is misunderstood, then explains why %7e might cause problems too:

In my site logs I have noticed an increase in errors due to the mistyping of the tilde: /-mudws /_mudws /=mudws etc. ...

... The combination /%7Emudws also proves troublesome to many--the % is often misread as a & or other symbol, and the introduction of mixed cases to the case-sensitive path segment adds another danger, and /%7EMUDWS is clearly wrong ( /%7emudws is theoretically correct). The one time I gave the "escaped" URL to a newspaper, it was garbled as badly as the tilde version.

As regards to experiences with newspapers, I once sent an article to the leading Finnish newspaper and mentioned the URL http://www.hut.fi/%7ejkorpela/tekoik.html and they printed it as http://www.hut.fi@jkorpela/tekoik.html (unbelievable, but true!).

Thus, although using %7e is to be preferred over incorrectly using plain ~ in URLs, it is by no means an optimal solution. But we have to ask what causes the whole problem in the first place.

The real problem: tildes in home page URLs

The need for using tildes in URLs is caused - almost exclusively - by a strange practice of using URLs of the form
http://server/~username/filename
(e.g. http://www.hut.fi/~jkorpela/tilde.html)

This is a strange Unixism in the World Wide Web, imitating the Unix practice of referring to the home directory of a user by notations like ~ (the user's own home directory) and ~username (the home directory of user username). More exactly, this is a convention applied in many (but not all) Unix shells, or command interpreters; it does not work universally even in the Unix universe.

There is hardly any explainable reason why such a convention was ever adopted. There is definitely nothing intuitive about it. How could you guess that ~ stands for 'home directory of'? Thus, people with no Unix background most probably have difficulties in realizing what the funny symbol ~ stands for.

Further confusion is caused by the fact that notation ~username does not even have the same meaning in URLs as in (some) Unix shells. Typically, it really refers to a subdirectory of the user's home directory. People have really got confused with this. For example, consider the URL of this document when written in the notation with an unencoded tilde in it: http://www.cs.tut.fi/~jkorpela/tilde.html. People who have direct access to the file system in which the file resides, can not use the file name ~jkorpela/tilde.html if they wish to refer to it locally and not via the Web; they need to write ~jkorpela/public_html/tilde.html in their Unix commands.

It's really a matter of configuring Web servers properly. People who are responsible for such things should make them map URLs into file names in a manner which makes tildes in URLs unnecessary. Typically, references to people's pages should be something like
http://server/u/username/filename
Webmasters may wish to configure the server recognize formats with something more explanatory than u there (say, users or home), either as the only option or as an additional option. Notice however that having several options there may cause problems, since people and programs may not realize that they are synonymous. Personally, I think u is just fine: it's short, easy to remember, and whatever you think about is mnemonicality, it's definitely better than either ~ or %7e. (On small servers, one might even consider a mapping scheme where the personal page URLs are of the form http://server/username/filename but on large servers that might cause too much maintenance trouble.)

Summary

To conclude, I strongly recommend


Date of last update: 1999-08-27. Technical corrections 2004-12-12.

This document is largely based on a discussion with subject should ~ (tilde) be escaped as %7E? in 1997 in the c.i.w.a.h. newsgroup.

Jukka Korpela