IT and communication - Characters and encodings:

Coverage of European languages by ISO Latin alphabets

This document presents the coverage of languages of European origin by ISO Latin alphabets (Latin alphabets No. 1 through 10 as defined by ISO 8859).

The basic information is from Annex A (informative) of the ISO 8859-15 standard, called the Annex in this document. Some information has been corrected on the basis of other sources and by comparing the encodings with the requirements of languages. There are many issues subject to interpretation in these matters, depending on what we regard as required characters in a language.

Legend (see the notes for detailed explanations):

Coverage of languages by ISO Latin alphabets
Language       Covered by alphabet(s)    Notes      
Albanian               10
Basque           3          
Cornish                      Note I
Croatian                       10
Danish             7!  Note II
Dutch                       Note IIbis  
English        10
Finnish        1*2!3!5*8*10Note III
French         1*  3*  5*    8*10Note IV
German         7!10Note V
Hungarian                      10
Irish         5*6*  9*10*Note VI
Italian                10
Latin          7?10Note VII
Manx Gaelic                      
Polish                   7!    10Note VIII
Romanian         2*              10Note IX
Sámi                 4*  6*        Note X
Scottish Gaelic              
Slovenian            7!    10Note XI
Swedish            7!  Note XII
Turkish            3*            Note XIII


General notes:

Note I (Cornish): Cornish is an extinct language with varying orthographies. According to some sources, some orthographies for "revived Cornish" use letters with diacritic marks. This is probably the reason why the Annex lists only Latin alphabets No. 1, 5, and 8 as suitable for Cornish.

Note II (Danish): The Annex does not Latin alphabets No. 7 as covering Danish, but in fact it has all letters needed in Danish. Cf. to note VIII.

Note IIbis (Dutch): The word "Flemish" is often used for the Dutch language in Belgium.

None of the alphabets contains the ij ligature (capital and small, U+0132 and U+0133 in Unicode), which has often been regarded as a member of the Dutch alphabet. A more common view nowadays is that the ligature need not be encoded as a separate character but as a two-character combination (ij or IJ).

Note III (Finnish): There are some characters in official Finnish orthography which are not covered in Latin alphabets No. 1, 5, and 8, namely "s" and "z" with caron. See notes on ISO Latin 9.

The Annex does not list Latin alphabets No. 2 and 3 as covering Finnish. However, alphabet 3 contains the same Finnish letters as alphabet 1, including "ä" and "ö" but excluding "s" and "z" with caron, and alphabet 2 contains the latter too, so it actually covers Finnish better than alphabet 1 does! Perhaps the reason for not listing alphabets 2 and 3 here is that they lack the letter "å" which is used in Swedish (and Danish and Norwegian). Although names containing that letter occur relatively often in texts otherwise in Finnish, this is in principle not different from other occurrences of foreign names (and loan words, like fiancé) in their original spelling. Generally, if one starts taking such occurrences into account when considering which alphabets cover a given language, there is no way to tell when to stop.

Note IV (French): There are some characters in French which are not covered in Latin alphabets No. 1, 3, 5, and 8, namely oe ligature and capital Y with diaeresis. See notes on ISO Latin 9.

Note V (German): The Annex does not Latin alphabets No. 7 as covering German, but in fact it has all letters needed in German. Cf. to note VIII.

Note VI (Irish): Latin alphabets No. 5, 6, 9, and 10 are not suitable for Irish when the old orthography is used.

Note VII (Latin): The Annex lists Latin alphabets No. 7 as covering Latin. However, Latin is often written so that e with diaeresis is used in some words (e.g. aër), and alphabet 7 does not contain that letter.

Note VIII (Polish): The Annex lists Latin alphabets No. 2 as the only one that covers Polish. However, alphabet 7 was developed to cover the needs of Latin-alphabet languages spoken in countries bordering the Baltic Sea ("Baltic Rim"), and Polish is explicitly mentioned among those languages in the definition of alphabet 7 (i.e. ISO 8859-13).

Note IX (Romanian): According to the Romanian Standards Institute, the diacritic mark which may appear under letters "s" and "t" in Romanian is not cedilla but comma below. According to this interpretation, strictly speaking no ISO Latin alphabet except the new ISO 8859-16 is suitable for Romanian, but according to ISO 8859-2, Latin alphabet No. 2 can be used "subject to the agreement of originator and receiver in information exchange".

The Unicode standard, compatibly with the introduction of the new ISO 8859-16, recognizes the interpretation described above as follows (in version 3.0, in chapter 7, section 7.1, in the description on Latin Extended-A):

In Turkish and Romanian, a cedilla and a comma below can replace one another depending on the font style. The letters U+015F LATIN SMALL LETTES S WITH CEDILLA and U+0163 LATIN SMALL LETTES T WITH CEDILLA (and their uppercase counterparts) have been duplicated at U+0219 LATIN SMALL LETTES S WITH COMMA BELOW and U+021B LATIN SMALL LETTES T WITH COMMA BELOW. The duplicated characters with explicit commas below are provided solely for compatibility with sociopolitical practices. Legacy encodings for these characters, including ISO/IEC 8859-2, contain only a single form of each of these characters, which is mapped to the form with cedilla.

Cf. to the position paper ISO/IEC JTC 1/SC 2/WG 3 N 441, which presents arguments against making the distinction between cedilla and comma below for these characters.

Note X (Sámi): The various Sámi languages and dialects have partly different spelling systems. Latin alphabets No. 4 and 6 cover the requirements of most Sámi orthographies, but for Skolt Sámi no ISO Latin alphabet is sufficient.

Note XI (Slovenian): The Annex lists Latin alphabets No. 2, 4, and 6 as covering Slovenian. However, it seems that alphabet 7 covers it too, since Slovenian needs, in addition to ASCII letters, only the letters "c", "s", and "z" with caron.

Note XII (Swedish): The Annex does not list Latin alphabet No. 7 as applicable to Swedish. There is no obvious reason to this, since that alphabet contains all the letters used in Swedish, such as "ä", "ö", "å", and "é". Perhaps the composers of the Annex thought that some names of foreign origin containing accented characters are used so often in Swedish that Latin alphabet No. 7 is not suitable for Swedish. An (expired) Internet draft titled Characters and character sets for various languages lists the letters äöåÄÖÅ (in addition to ASCII letters, of course) as "required characters" and áéëüÁÉËÜ as "important characters" for Swedish. Alphabet 7 lacks áÁëË, but these can hardly be regarded as necessary for Swedish (cf. to second part of note III).

Note XIII (Turkish): The use of Latin alphabet No. 3 for Turkish is deprecated.

The language names are links to relevant entries in the collection of link lists for several languages on the iLoveLanguages site (a large catalog of language-related Internet pages).