Lintula starts a transition to
UTF-8 character encoding instead of the old default of Latin 1. On
this page some information about character sets and settings.
The Lintula default character set is going to change from Latin 1 to UTF-8 encoding for new users, whose accounts are created after the 23rd of August 2011. The change is held back from old users by the creation of a ~/.i18n file that approximates the old defaults. One can move on to the new defaults by deleting this file.
Unfortunately the change will not necessarily be painless. Please read this page to find out what all this means and what results from it.
Computers process only abstract numeric values. In order to deal with characters such as letters, punctuation and even numbers, a numerical value has been assigned for them. A standard that defines a mapping between numerical values and characters is called a character set. If a text is read with a different character set than it was written with, some or all of the characters may be interpreted differently than they were supposed to.
ASCII is an old 7-bit character set that contains upper- and lowercase letters between A-Z, numbers 0-9, and certain punctuation characters. From a Finnish point of view, the biggest deficiencies are the characters Å, Ä and Ö, which the character set does not contain. ASCII-formatted text is completely readable also while using Latin 1 or UTF-8 encoding, as it forms the basis for both.
ISO 8859-1, also known as Latin 1, is a somewhat newer 8-bit character set. It expands ASCII with letters needed for writing most West-European and American languages, as well as some new special characters. It's fairly sufficient for writing Finnish, but does not contain for example Greek or Cyrillic letters, nor letters needed for writing Asian languages.
UTF-8 is not actually a character set but a character encoding for representing of characters in the Unicode character set. This character set should contain characters needed for writing all living languages and more. UTF-8 codewords are of variable length, either 8, 16, 24 or 32 bits long depending roughly on how 'western' the character is. For reasons of simplicity, this document will refer to UTF-8 as a character set and Unicode will not be mentioned anymore.
The default character set doesn't affect high-level applications such as firefox, openoffice or thunderbird much. These applications mostly handle file formats that have a defined character encoding, and know how to handle user input appropriately.
When expecting UTF-8, those Latin 1 characters that are not also ASCII characters are an error. Programs may react to such in in various ways. Here is an example of a string of bytes of Latin 1 text, and what it looks like when interpreted by UTF-8 (� means a broken character):
| String of bytes (as base-16 numbers) | E4 | E4 | 6B | 6B | F6 | 73 | 69 | E4 |
| Interpretation by Latin 1 | ä | ä | k | k | ö | s | i | ä |
| Interpretation by UTF-8 | � | � | k | k | � | s | i | � |
When expecting Latin 1, those UTF-8 characters that are not also ASCII characters get interpreted as two or more garbage characters. Here is an example of a string of bytes of UTF-8 text, and what it looks like when interpreted by Latin 1:
| String of bytes (as base-16 numbers) | C3 | A4 | C3 | A4 | 6B | 6B | C3 | B6 | 73 | 69 | C3 | A4 |
| Interpretation by UTF-8 | ä | ä | k | k | ö | s | i | ä | ||||
| Interpretation by Latin 1 | Ã | ¤ | Ã | ¤ | k | k | Ã | ¶ | s | i | Ã | ¤ |
Converting a Latin1-encoded file input.txt into an UTF-8-encoded file output.txt:
iconv -f iso8859-1 -t utf8 input.txt > output.txt |
You can find files whose names have potential problems from under the directory my_directory in this manner:
find my_directory | perl -ne 'print if /[^[:ascii:]]/' |
If there are only a few files, it can be the easiest to use the graphical applications of the desktop to rename them. The tab-completion and wildcard features of the shell may also help, for instance ? can be used for 'any character' and * for any string. Something like this can work, but make sure when using wildcards that the result doesn't match more than one file or the command may not do what you expect:
mv t?ss?_on_??kk?si?.txt tassa_oli_aakkosia.txt |