[Lintula home] -> [Instructions] -> [Changes in the default character encoding in Lintula]
[Tämä sivu suomeksi]

Lintula

Changes in the default character encoding in Lintula


Lintula starts a transition to UTF-8 character encoding instead of the old default of Latin 1. On this page some information about character sets and settings.

What is about to happen?

The Lintula default character set is going to change from Latin 1 to UTF-8 encoding for new users, whose accounts are created after the 23rd of August 2011. The change is held back from old users by the creation of a ~/.i18n file that approximates the old defaults. One can move on to the new defaults by deleting this file.

Unfortunately the change will not necessarily be painless. Please read this page to find out what all this means and what results from it.

What is a character set?

Computers process only abstract numeric values. In order to deal with characters such as letters, punctuation and even numbers, a numerical value has been assigned for them. A standard that defines a mapping between numerical values and characters is called a character set. If a text is read with a different character set than it was written with, some or all of the characters may be interpreted differently than they were supposed to.

What are the character sets mentioned in this document?

ASCII is an old 7-bit character set that contains upper- and lowercase letters between A-Z, numbers 0-9, and certain punctuation characters. From a Finnish point of view, the biggest deficiencies are the characters Å, Ä and Ö, which the character set does not contain. ASCII-formatted text is completely readable also while using Latin 1 or UTF-8 encoding, as it forms the basis for both.

ISO 8859-1, also known as Latin 1, is a somewhat newer 8-bit character set. It expands ASCII with letters needed for writing most West-European and American languages, as well as some new special characters. It's fairly sufficient for writing Finnish, but does not contain for example Greek or Cyrillic letters, nor letters needed for writing Asian languages.

UTF-8 is not actually a character set but a character encoding for representing of characters in the Unicode character set. This character set should contain characters needed for writing all living languages and more. UTF-8 codewords are of variable length, either 8, 16, 24 or 32 bits long depending roughly on how 'western' the character is. For reasons of simplicity, this document will refer to UTF-8 as a character set and Unicode will not be mentioned anymore.

What does the default character set affect?

What does the default character set NOT affect?

The default character set doesn't affect high-level applications such as firefox, openoffice or thunderbird much. These applications mostly handle file formats that have a defined character encoding, and know how to handle user input appropriately.

What kind of problems result from using the wrong character set?

When expecting UTF-8, those Latin 1 characters that are not also ASCII characters are an error. Programs may react to such in in various ways. Here is an example of a string of bytes of Latin 1 text, and what it looks like when interpreted by UTF-8 (� means a broken character):

String of bytes (as base-16 numbers) E4 E4 6B 6B F6 73 69 E4
Interpretation by Latin 1 ä ä k k ö s i ä
Interpretation by UTF-8 k k s i

When expecting Latin 1, those UTF-8 characters that are not also ASCII characters get interpreted as two or more garbage characters. Here is an example of a string of bytes of UTF-8 text, and what it looks like when interpreted by Latin 1:

String of bytes (as base-16 numbers) C3 A4 C3 A4 6B 6B C3 B6 73 69 C3 A4
Interpretation by UTF-8 ä ä k k ö s i ä
Interpretation by Latin 1 à ¤ à ¤ k k à s i à ¤

What problems result from changing the character set?

How can I convert a text file from Latin 1 to UTF-8?

Converting a Latin1-encoded file input.txt into an UTF-8-encoded file output.txt:

iconv -f iso8859-1 -t utf8 input.txt > output.txt

How can I convert the name of a file from Latin 1 to UTF-8?

You can find files whose names have potential problems from under the directory my_directory in this manner:

find my_directory | perl -ne 'print if /[^[:ascii:]]/'

If there are only a few files, it can be the easiest to use the graphical applications of the desktop to rename them. The tab-completion and wildcard features of the shell may also help, for instance ? can be used for 'any character' and * for any string. Something like this can work, but make sure when using wildcards that the result doesn't match more than one file or the command may not do what you expect:

mv t?ss?_on_??kk?si?.txt tassa_oli_aakkosia.txt

Outside references to wikipedia


24.08.2011