Techniques for multilingual Web sites:
Notes

Content:

Producing the translations

When producing different language versions, automatic translation programs might be used to some extent; see my article Translation-friendly authoring. However, there should be a competent human translator who is responsible for the translation work. Optimally the human translator should know the basics of the HTML language so that he can produce the translation directly as an HTML document. That way the material to be translated could be delivered in HTML document, and the translator would replace the texts, leaving (usually) the HTML markup as it is.

As another alternative, the text could be given to the translator either as a plain text file or as displayed by a Web browser, for example as printed on paper. In the latter case the translator could deduce some relevant information from the appearance of the text. On the other hand, HTML markup could better tell the intended structure of the document, which may have some significance is selecting between alternatives in the translation. In any case, if the translator sends only the translated text, then someone else has to put it into HTML format, in practice by merging the text with HTML markup, and this cannot be done without knowing the language of the translation to some extent.

Translation or transmogrification?

The versions of a page in different languages can be "pure translations" of each other; in practice that would usually mean that one of the versions is the original one and other versions have been translated from it. A "pure translation" consists of the original document, with the content and form strictly preserved, just expressed in another language. This means, for example, that the translation also contains the same factual errors as the original, the same references to local (e.g. Finland-specific) states of affair, etc.

Quite often a pure translation is not appropriate for the purposes of the page. On the other hand, it is not adequate to use language negotiation mechanism to distribute documents with completely different content, just with the same topic. It is difficult to draw the line.

The specification of the language negotiation mechanism does not require that the versions be exactly equivalent. On the contrary, the mechanism contains the possibility of specifying quality values, which may result in a selection of a version in a language which has lower in the user's preferences than another available language, due to quality difference. For example, if the user knows German a little better than French, he could have specified this in his language preferences; if the server has a version of the requested document in German but also a considerably more up-to-date or more extensive version in French, it might respond by sending the latter. In practice such situations are probably still rare, partly because popular browsers do not let the users control the quality values associated by languages, only the repertoire and ordering of languages in the user's preferences.

As regards to localization, care must be taken. For example, when writing an English version of a Finnish document one should not blindly assume that all references to Finland-specific information are to be removed or replaced by something more international. It is quite possible that the reader lives in Finland or is otherwise interested in the situation in Finland.

Indicating what is available in each language

When you have a multilingual site, it is crucial to tell people what is really available in different languages. For example, if your site is dominantly in German but has a few pages in English as well, make it very clear in the English version that it presents only a small part of the information available in German. Otherwise a visitor who knows both languages but prefers English might never make real use of the site.

It is mostly sufficient to include such information into the main page in each language. But, for example, if the site contains a news page so that some but not all of the articles are available in German too, then it would be misleading to make the German version contain those articles only. Instead, the news page should minimally say that more news are available in English (naturally a link with that English page). It could also contain links to English news articles that have not been translated, merged with the news in German. Preferably, the headlines of such news should appear as translated, along with a clear indication of the link pointing to text in English.

Naming the versions

When selecting Web addresses (URLs) for versions of documents in different languages, a systematic approach is often desirable, for practical reasons like creating and maintaining the pages. This can be implemented in different ways; the method could, for example, be either of the following:

Both methods have the problem that the "proper name" of the document (in our example, foo) had better be reasonably understandable internationally. This can be achieved rather naturally, if you can use some widely known abbreviation or a part of an international word - but these are in principle exceptional situations. In other cases the addresses may look rather odd; something like katsaus-en is strange to people who don't know Finnish. Although users normally should not have to read and type URLs, such things are often necessary in practice. In principle the addresses of the different versions could be totally independent of each other, e.g. with file name parts like katsaus.html, review.html etc., but this would complicated the implementation and maintenance, especially as regards to settings for language negotiation.

In practice the URLs of the different versions are mostly determined by the file names; usually the last part of a URL maps to a file name. In principle a URL and a file name need not have anything to do with each other.

Character issues

This documentation does not discuss problems with characters, although such problems inevitably exist at least when there are such languages involved that are not written using a Latin script. The exclusion of such problems here has been made to the complexity of the topics; it is best the divide the problem area to manageable pieces. There's also a practical reason: in quite a few situations, the languages involved can be written using the ISO Latin 1 character repertoire, so that multilingualism does not create any new character problems (in addition to character problems with, say, German or French).

For a general discussion of character code problems in HTML authoring, see my document Using national and special characters in HTML.

What about the lang attribute?

We have not discussed the lang attribute, which can be used in HTML markup to specify the natural language used. For a description of the ideas behind it as well as the technicalities, see section Language information and text direction of the HTML specification. Unfortunately the support to this attribute in browsers and other programs is almost nonexistent, though the situation is improving.

It needs to be emphasized however that the lang attribute in an HTML document, even if set globally there (<html lang="...">) is not expected to affect and does not affect language negotiation in HTTP. The negotiation takes place without considering the content of a document (textual content or markup) at all.

Country and language

Surprisingly often Web authors try to find out which country the user comes from and redirect him to a version of a document in a particular language. In addition to the fact that there is no reliable way of determining the country using any automatic mechanism (partly due .com, .org etc. domains), the whole idea is flawed. You cannot deduce language from country (or vice versa). Making a guess about a language according to country is useless, even harmful, in contexts like this. If a Swedish-speaking Finnish citizen gets directed to a page in Finnish, despite the availability of a Swedish version, then it is irrelevant that the the probability of guessing right was, say, over 90%.

Language preferences and JavaScript

In the JavaScript language, it is under some conditions possible to determine the browser language. This however is almost always useless, and it has nothing to do with the user's language preferences. The browser language is just the language of the browser's user interface, i.e. the language used in menus, error messages, etc.

It is very common to use English versions of browsers just because there are no alternatives or versions in other languages have confusing translations for terms. The basic use of a browser does not require much understanding of the browser language, since most of the basic functions can be activated using icon buttons or other simple tools so that if suffices to know a very small repertoire of words.

There is no fundamental reason why a language like JavaScript could not be used for reading the language preference settings in a browser. But it seems that currently such a possibility does not exist. (In Netscape, a so-called signed script can read language preferences; but then the browser asks the user for permission, so you might just as well set up a browser-independent dialog for language selection.) Such a feature would be of limited usefulness, but it would make it possible to make JavaScript-generated texts appear in different languages according to language preferences.

About tools

It is rather awkward to create and maintain a multilingual site without suitable tools, even if the creation and maintenance of the different language versions themselves can be handled. The important question about tools will not be discussed here, except just by sketching one solution for a case of a small site.

This documentation was created, after testing various approaches, as follows:

Making use of language preferences in CGI scripts

In CGI scripts, it is possible to utilize language preferences as sent by browsers. The value of Accept-Language header as defined in the protocol manifests itself to a CGI script as the environment variable HTTP_ACCEPT_LANGUAGE (which needs to be written this way, using uppercase letters).

According to the protocol, the value of this variable contains a comma-separated set of parts, each of which consists of a language code that is optionally followed by the specification of a q value. It is relatively easy to parse this e.g. in a CGI script written in Perl, using the split function for division into parts. The following code sample performs this and sets the variable $preferred to the language code that corresponds to the language that is primary according to the preferences. Here we set English as the default language, to be implied, if the browser sends no preferences.

$accept = $ENV{'HTTP_ACCEPT_LANGUAGE'};
@prefs = split(/,/,$accept);
$preferred = 'en';
$prefq = 0;
foreach $pref(@prefs) {
   if($pref =~ /(.*);q=(.*)/ ) {
      $lang=$1; $qval=$2; }
   else {
      $lang=$pref; $qval= 1; }
   if($qval > $prefq) {
      $preferred = $lang; $prefq = $qval; }}

The result can be used e.g. to index a hash containing language-dependent strings. For example, if we would like to have a CGI script in Perl which, when dynamically generating an HTML document, to write texts either in Finnish or in English, we could write the alternate texts into a hash and pick up the right text from it as the following example shows:

$gen{'en'} = 'Report generated at ';
$gen{'fi'} = 'Raportin luontihetki: ';
 - -
print "<div>$gen{$preferred} $now.</div>";

The little service for showing the language settings (which was already mentioned in section Language settings in browsers) is based on the technique discussed above.

Links to specific locations on pages

A link can refer either to a page as a whole or to a specific location within a page. The latter is possible for such locations that are marked, on the page being referred to, with a construct like <a name="anhcorname">text</a>. It is recommendable to include such markup at least into every heading text on a page. Here anchorname is a string selected by the page author, for use as an identification of the location. In practice however that string becomes visible to users in some situations.

When creating multilingual sites, it is best to use the same anchor names in different language versions. This makes it possible to refer to the locations using URL references like
http://www.cs.tut.fi/~jkorpela/multi/8.html#prod
where the part #prod appears as appended to the generic address. Done this way, things should work irrespectively of how the generic address refers to a particular language version through the language negotiation mechanism.

More information

Alan Flavell's document Language Negotiation Notes is certainly worth reading to anyone interested in the language negotiation mechanism. And so is Dan's Web Tips: Languages.


2008-05-26 Jukka K. Korpela