ANSI

The first widely character set was the 7-bits ANSI, with values ranging from 0 to 127. Being developped for English, it uses latin character set, but without accents and other punctuation signs.

In the '80s, extensions were provided by using 8-bits character tables, whose characters 128 to 255 where used to encode the missing values. But there were so many that those 128 values were not enough. So a number of maps where defined. For instance, ISO-8859-1 for Western Europeans Languages, with letter for french: é, Nordic languages: Ø, a few symbols: ½, and so on. Typical computer support consisted in early loading the adequate character map, then glyphs were rendered correctly.

The first issue with this approach is about conversion. To view some text in Greek or Cyrillic language on a display configured for Western European requires to switch back and forth between codepages.

Unicode

Unicode is a standard and an effort to encode symbols from every language existing or having existed on Earth. There are actually 190000 signs from 93 languages. Unicode is equivalent to ISO/CEI 10646. Unicode consists of

a table of symbols, each with an unique name, like "GREEK SMALL LETTER ALPHA" for α
encoding norms: UTF-16, UTF-32, UTF-8
glyphs: a screen representation for each symbol

Storage and binary representation

UTF-32 stores each symbols on 4 bytes according to two schemes: Big Endian and Little Endian
UTF-16 stores most of its symbols on 2 bytes; rarelly used values are stored using a sequence of "prefix-value". Two schemes: Big Endian and Little Endian
UTF-8 was designed to be mostly compabile with ASCII; symbols storage is either 1, 2, 3, 4 bytes. This scheme is defined sequentially, there is no ambiguity linked to its endianess.

C and C++ support

There are three types of "char" in C and C++: plain (unqualified), signed and unsigned. The two latters were added to have similar behavior as 'int', as chars may be used to store small numbers. The standard says:

3.9.1 Fundamental types [basic.fundamental]

Objects declared as characters char) shall be large enough to store any member of the implementation's basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character. It is implementation-defined whether a char object can hold negative values. Characters can be explicitly declared unsigned or signed. Plain char, signed char, and unsigned char are three distinct types. A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (basic.types); that is, they have the same object representation. For character types, all bits of the object representation participate in the value representation. For unsigned character types, all possible bit patterns of the value representation represent numbers. These requirements do not hold for other types. In any particular implementation, a plain char object can take on either the same values as a signed char or an unsigned char; which one is implementation-defined.

What is fundamental here is that usual characters should be declared as "chars" or "signed chars". "Unsigned char" means they MAY be submitted to truncation of the eighth bit, this is implementation-dependant.

In order to support wide-characters, the two-byte storage wchar_t was added to the C standard. Functions whose argument is wchar instead of char are generally prefixed by "w".

Characters functions

Some of the basic functions about characters strings are listed below:

length: how many symbols ?
tests: is this symbol a letter, a number, a punctuation sign, ... ?
finding chars inside a string
concatenating strings
displaying strings

Here is a brief summary of the "features" of each encoding with respect to those functions:

Function	UTF-8	UTF-16	UTF-32
length	number of symbols <= storage length	number of symbols <= storage length	number of symbols proportionnal to storage length
tests	must be aware of UTF-8 and locales	must be aware of UTF-16, locales AND endianess	must be aware of UTF-32, locales AND endianess
finding symbols	must implementent a sequential machine for prefix codes	must implementent a sequential machine for prefix codes; must be aware of endianess	must be aware of endianess
concatenating strings	must be 8-bit compatible	must verify the endianess are the same	must verify the endianess are the same
displaying strings	must pass it to external app without truncating the 8th bit	must ensure the external app is UTF-16; must check for endianess	must ensure the external app is UTF-32; must check for endianess

Cons

UTF-8: not all sequences are valid
UTF-16: not all sequences are valid. When using standard ASCII, memory waste is 50%. While transmitting strings, around half of the chars are zeros.
UTF-32: not all sequences are valid. When using standard ASCII, memory waste is 75%. While transmitting strings, around three-quarters of the chars are zeros.

International Characters Support

Contents

ANSI

Unicode

Storage and binary representation

C and C++ support

Characters functions

Cons

Octave and international support

Graphs

Locales

Navigation menu

International Characters Support

ANSI

Unicode

Storage and binary representation

C and C++ support

Characters functions

Cons

Octave and international support

Graphs

Locales

Navigation menu

Search