Using a specific character encoding in Linux
Welcome to the second part of our article about the encoding world! In the first part we saw what a Character Encoding Standard is and why there are so many out there! In this second part we will see how to set a specific encoding on Linux and how (and why!) to pass from an encoding standard to another one.
What do we do to show the correct characters in a Linux environment?
The answer depends on what we intend for ‘show’. Here we have two different aspects:
- the character encoding in a text file or in the filename (as seen on previous chapter);
- the character visualization: that depends on where we are actually trying to visualize the character. On the console, on a text editor, on the browser? That is more a question of having the right font to visualize the specific character, and is not an argument of these writing.
In this chapter we will concentrate on the first problem: we will ensure that the Linux system supports the correct encoding for the file.
Every ‘national’ aspect in Linux (including characters encoding) is managed by the so-called ‘locales’. The locales system is quite wide and includes many other aspect of regionalization, as collation, translations in different languages, the date and time showing and so on.
Linux is a multi-user system at its core (although in these days is more and more used as a ‘personal’ computer system) so every user in a Linux system can choose his locale, independently from other users in the system. First, we should decide which encoding to choose. Today, the most widespread is UTF-8: it’s an Unicode encoding (so it can map EVERY known character) and is relatively compact and efficient. Moreover, it maps directly on ASCII, so every ASCII character appears correct in an UTF-8 system without any need to transcode.
ISO-8859 is another possible candidate: if you are absolutely certain that you will NEVER use characters outside of your chosen ISO-8859 table, you can live happily with that. But be aware! Every character NOT present in your ISO-8859 table will be mapped on a possibly different character in your system!
To use a specific locale you simply set a special environment variable named ‘LANG’.
If I choose to set my locale to italian language and UTF-8 encoding, I shold put:
somewhere in my initialization scripts (usually in ~/.bash_profile or in ~/.bashrc). That variable assignment means: “use as locale the language italian in its ITalian national variant with UTF-8 encoding”. The ‘it_IT’ is used for substituting the strings in the programs to the italian translation (if available), for collating (order of strings), for date and time format (dd/mm/yyyy and 24h time) and many other little quirks specific to the italian people. The ‘UTF-8’ part means that the system will use UTF-8 encoding. Beware! Every program you execute should read this variable and use the right encoding, but this is not mandatory! If you run a program which don’t support UTF-8 and only understands ISO-8859 or viceversa (and there are!), well… problems are awaiting you.
Let’s see an example. Suppose you have your system set up as seen in the previous paragraph in UTF-8 and your favorite text editor does manage UTF-8; suppose you need to visualize the results of your editing inside a program that does not understand UTF-8, but only ISO-8859 (it’s an old program, written before the born of Unicode specification). Until you write your English text without character outside the ASCII standard, everything goes well. When you save the text, you write a pure ASCII file. UTF-8 is completely back compatible with ASCII, and ISO-8859 too, so you can visualize the exact same text. But if you put in your editor an accented character, like ‘à’, write the file and try to visualize in your old visualizer the result, you end with a ‘Ã’ (a capital A with tilde followed by a space).
When you save your ‘à’ letter in UTF-8 encoding you write a byte sequence in your file, in these specific case you are writing ‘0xC3’ followed by ‘0xA0’, which means exactly ‘à’ in UTF-8. But your old visualizer does not understand UTF-8, remember? So, it tries to match this byte sequence in the encoding it knows. In ISO-8859-1 the byte ‘0xC3’ means “Capital letter A with tilde” and ‘0xA0’ means “no-break space” (it’s a special type of space). That’s it!
Normally, in modern systems that should never happen: a program should ever check the encoding of a file (but be aware, it’s an heuristic process!) and adapts its decoding of the file based on that knowledge. But sometimes this check fails, or sometimes the program does not check; so – sometimes – you can end up with this situation.
What can you do to render again readable a file with the wrong encoding?
Well, luckily Linux provide an useful program called ‘iconv’, that does just this: it converts files from one encoding to another.
iconv -f UTF-8 -t ISO_8859-1 FileEncodedInUTF8.txt > FileEncodedInISO8859.txt
Simple and powerful.