Character Encodings in Linux: ASCII, UTF-8 and ISO-8859
A computer represents information in numbers and, when they need to be communicated to Humans (and vice versa) they need to be encoded. Read the article to know more about this and stay tuned for the second part ‘Using a specific character encoding in Linux’.
First of all, some definitions:
- ASCII: abbreviated from American Standard Code for Information Interchange, is a character encoding standard. Originally based on the English alphabet, ASCII encodes 128 specified characters into seven-bit integers. The characters encoded are numbers from 0 to 9, lowercase letters a to z, uppercase letters A to Z, basic punctuation symbols, control codes that originated with Teletype machines, and a space. For example, lowercase ‘j’ would become binary 1101010 and decimal 106. ASCII includes the definitions for only 128 characters: 33 are non-printing control characters (many now obsolete) that affect how text and space are processed and 95 printable characters, including the space (which is considered an invisible graphic character).
- ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts. While the bit patterns of the 95 printable ASCII characters are sufficient to exchange information in modern English, many other languages that use Latin alphabets need additional symbols not covered by ASCII. ISO/IEC 8859 sought to fix this problem by utilizing the eighth bit in an 8-bit byte to allow positions for another 96 printable characters. Early encodings were limited to 7 bits because of restrictions of some data transmission protocols, and partially for historical reasons. However, more characters that could fit in a single 8-bit character encoding were needed, so several mappings were developed, including at least ten suitable for various Latin alphabets.
- Unicode is a computing industry standard for the consistent encoding, representation, and handling of texts expressed in most of the world’s writing systems. Developed in conjunction with the Universal Coded Character Set (UCS) standard and published as ‘The Unicode Standard’, the latest version of Unicode contains a repertoire of more than 128,000 characters covering 135 modern and historic scripts, as well as multiple symbol sets. Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8, UTF-16 and the now-obsolete UCS-2.
- UTF-8 is a character encoding capable of encoding all possible characters, or code points,. Defined by Unicode and originally designed by Ken Thompson and Rob Pike. The encoding has a variable length and uses 8-bit code units. It was designed for backward compatibility with ASCII and to avoid the complications of endianness (question) and byte order marks in the alternative UTF-16 and UTF-32 encodings. The name is derived from: Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.
As we can see, ASCII is pretty limited (only 128 characters, but if we count out the control, unprintable ones there remains only 96), but is universally accepted, expecially in the English-speaking world. What about all the character NOT in the ASCII table? Think about all the “accented” characters (as “à, è, ì, ò, ù” for Italian and French speaking, or the circumflexed characters or the umlaut, or the tilde… There are so many of them!).
ISO-8859 is a first attempt in resolving the problem: since ASCII is 7 bit, we can use the eight bit to add another set of 96 characters to the ASCII standard: if the eight, most significant bit is 0, we have the ASCII. If it is 1, we have the extended characters. But 96 more characters are not enough, not even for latin-based Europe languages. Should I mention chinese ore japanese? Or arabic? So ISO-8859 has 15 different ‘extended’ tables. If I should encode the character ‘à’ I will use ISO-8859-15 (the ‘Latin-9’, the standard table for Western Europe languages as Italian) and the byte will have value 0xe0. But if I forgot that the encoding is ISO-8859-15 and try to use another one, let’s say ISO-8859-4, I will end with the character ‘?’ (‘long a’). How many times have you downloaded a text file or copied an mp3 and then discover that the title or the content is a mess of character hardly understandable? That happens because the original file (or its name) was encoded with one ISO-8859 table but your system uses another one.
So, as we see, we should have the knowledge of the encoding to understand a text, otherwise “strange” characters appears everywhere. But, what if we could use a large, univocal table that can contain ALL the known character in the world? There will be no more the problem of which table to use!
Here enters Unicode!
Unicode was created for that purpose: to contain EVERY character in the world, present, past and future (yes, in Unicode we have even Klingon, from Star Trek!) and even alternative, as the Tengwar characters used in the Tolkien world. But Unicode is quite abstract, and we need a computer-based representation: the simpler (but rarely used) one is UTF-32. In these encoding every character in the Unicode is identified by 4 bytes (32 bits). That’s 2^32 possible characters, you know… A big, big number, something like 4’294’967’296 characters… In other words, over 4 billions!
Problem solved, right? No.
Because 32 bit for every character in every string is too much, a gigantic waste of space. There are other, more efficient (and more used) Unicode encodings: UTF-16 and UTF-8. As the names implies, it seems that UTF-16 will encode all characters in 16 bits and UTF-8 in 8 bits. But that is not true (and obviously impossible). That number in the name is the minimal size of one character. In UTF-8 a character will use AT LEAST 8 bit, in UTF-16 it will use at least 16 bits. But the size is variable, and that means that an UTF-8 character can use 1, 2 or 4 bytes, depending on the character.
As an example, all ASCII characters are directly mapped in UTF-8: it means that a file encoded in UTF-8 containing ONLY ASCII characters will be correctly read in an ASCII system without transcoding. And viceversa. All the ‘accented’, national characters require 2 bytes. More exotic characters (like japanese, chinese, korean) may require 4 bytes. The real beauty of all this is that in an UTF-8 (or UTF-16 or UTF-32) system we can mix all that characters in a text file without need to change the encoding. Try to visualize an italian-japanese dictionary in ISO-8859!
At this point we have mentioned a number of different encoding (but there are many more, out there!). Some of these encodings are quite compatible: think of a pure-English text, it can be exactly the same if encoded in ASCII or ISO-8859 (every table!) or UTF-8. Some other are not so compatible, as UTF-32 vs ASCII. If our text uses some ‘national’ characters (outside of the ASCII ones) we surely will encounter transcoding problems when mixing different encoding systems.
Stay tuned! The second part of the article will be published very soon!