International Characters Support: Difference between revisions

From Octave
Jump to navigation Jump to search
No edit summary
 
(11 intermediate revisions by 5 users not shown)
Line 1: Line 1:
=ANSI=
=ANSI=


The first widely character set was the 7-bits ANSI, with values ranging from 0 to 127. Being developped for English, it uses latin character set, but without accents and other punctuation signs.
The first widely used character set was 7-bit ASCII, with values ranging from 0 to 127. Being developed for English, it uses a Latin character set, but without accents or other punctuation signs (diacritical marks).


In the '80s, extensions were provided by using 8-bits character tables, whose characters 128 to 255 where used to encode the missing values. But there were so many that those 128 values were not enough. So a number of maps where defined. For instance, ISO-8859-1 for Western Europeans Languages, with letter for french: é, Nordic languages: Ø, a few symbols: ½, and so on.
In the '80s, extensions were provided by using 8-bit character tables, whose code numbers 128 to 255 were used to encode the missing values. But there were so many characters that those additional 128 values were not enough. So a number of maps (code pages) where defined. For instance, ISO-8859-1 for Western European Languages, with letters for French: é, German: ä, Nordic languages: Ø, a few math symbols: °, µ, ½, and so on.
Typical computer support consisted in early loading the adequate character map, then glyphs were rendered correctly.
Typical computer support consisted of loading the adequate character map beforehand; then glyphs were rendered correctly.


The first issue with this approach is about conversion. To view some text in Greek or Cyrillic language on a display configured for Western European requires to switch back and forth between codepages.  
The first issue with this approach is about conversion. To view some text in Greek or Cyrillic language on a display configured for Western European requires switching back and forth between code pages.


=Unicode=
=Unicode=
Unicode is a standard and an effort to encode symbols from every language existing or having existed on Earth. There are actually 190000 signs from 93 languages. Unicode is equivalent to ISO/CEI 10646. Unicode consists of
Unicode is a standard and an effort to encode symbols from every language existing or having existed on Earth. There are actually 190,000 signs from 93 languages. Unicode is equivalent to ISO/CEI 10646. Unicode consists of:
* a table of symbols, each with an unique name, like "GREEK SMALL LETTER ALPHA" for α
* a table of symbols, each with an unique name, like "GREEK SMALL LETTER ALPHA" for α
* encoding norms: UTF-16, UTF-32, UTF-8
* encoding norms: UTF-16, UTF-32, UTF-8
Line 15: Line 15:


=Storage and binary representation=  
=Storage and binary representation=  
* UTF-32 stores each symbols on 4 bytes according to two schemes: Big Endian and Little Endian
* UTF-32 stores each symbol in 4 bytes according to two schemes: Big Endian and Little Endian
* UTF-16 stores most of its symbols on 2 bytes; rarelly used values are stored using a sequence of "prefix-value". Two schemes: Big Endian and Little Endian
* UTF-16 stores most of its symbols in 2 bytes; rarely used values are stored using a sequence of "prefix-value" called a "surrogate pair". Two schemes: Big Endian and Little Endian
* UTF-8 was designed to be mostly compabile with ASCII; symbols storage is either 1, 2, 3, 4 bytes. This scheme is defined sequentially, there is no ambiguity linked to its endianess.
* UTF-8 was designed to be mostly compatible with ASCII; symbol storage is either 1, 2, 3, 4 bytes. This scheme is defined sequentially; there is no ambiguity linked to its endianness.


=C and C++ support=
=C and C++ support=
There are three types of "char" in C and C++: plain (unqualified), signed and unsigned. The two latters were added to have similar behavior as 'int', as chars may be used to store small numbers. The standard says:
There are three types of "char" in C and C++: plain (unqualified), signed and unsigned. The two latters were added to have similar behavior as 'int', as chars may be used to store small numbers. The [http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1905.pdf C++ standard] says:


3.9.1 Fundamental types [basic.fundamental]
3.9.1 Fundamental types [basic.fundamental]


Objects declared as characters char) shall be large enough to store any member of the implementation's basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character. It is implementation-defined whether a char object can hold negative values. Characters can be explicitly declared unsigned or signed. Plain char, signed char, and unsigned char are three distinct types. A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (basic.types); that is, they have the same object representation. For character types, all bits of the object representation participate in the value representation. For unsigned character types, all possible bit patterns of the value representation represent numbers. These requirements do not hold for other types. In any particular implementation, a plain char object can take on either the same values as a signed char or an unsigned char; which one is implementation-defined.
<q>Objects declared as characters (char) shall be large enough to store any member of the implementation's basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character. It is implementation-defined whether a char object can hold negative values. Characters can be explicitly declared unsigned or signed. Plain char, signed char, and unsigned char are three distinct types. A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (basic.types); that is, they have the same object representation. For character types, all bits of the object representation participate in the value representation. For unsigned character types, all possible bit patterns of the value representation represent numbers. These requirements do not hold for other types. In any particular implementation, a plain char object can take on either the same values as a signed char or an unsigned char; which one is implementation-defined.</q>


What is important here is that usual characters should be declared as "chars" or "signed chars". "Unsigned char" means they MAY be submitted to truncation of the eighth bit, this is implementation-dependant.
What is important here is that usual characters should be declared as "chars" or "signed chars". "Unsigned char" means they MAY be submitted to truncation of the eighth bit, this is implementation-dependent.


In order to support wide-characters, the two-byte storage wchar_t was added to the C standard. Functions whose argument is wchar instead of char are generally prefixed by "w".
In order to support "wide" characters with an extended range of values, the storage type <code>wchar_t</code> was added to the C standard. The size of <code>wchar_t</code> is system dependent: on Windows, it is 2 bytes, and on Linux and macOS it is 4 bytes. C++11 introduced defined-size wide character types <code>char16_t</code> and <code>char32_t</code>. Functions whose argument is <code>wchar_t</code> instead of <code>char</code> are generally prefixed by "w".


=Characters functions=
=Character functions=
Some of the basic functions about characters strings are listed below:
Some of the basic functions about characters strings are listed below:
* length: how many symbols ?
* length: how many symbols ?
Line 50: Line 50:
| number of symbols <= storage length
| number of symbols <= storage length
| number of symbols <= storage length
| number of symbols <= storage length
| number of symbols proportionnal to storage length
| number of symbols proportional to storage length
|-
|-
| tests
| tests
| must be aware of UTF-8 and locales
| must be aware of UTF-8 and locales
| must be aware of UTF-16, locales AND endianess
| must be aware of UTF-16, locales AND endianness
| must be aware of UTF-32, locales AND endianess
| must be aware of UTF-32, locales AND endianness
|-
|-
| finding symbols
| finding symbols
| must implementent a sequential machine for prefix codes or be 8-bit clean and be able to locate sequences
| must implement a sequential machine for prefix codes or be 8-bit clean and be able to locate sequences
| must implementent a sequential machine for prefix codes; must be aware of endianess
| must implement a sequential machine for prefix codes; must be aware of endianness
| must be aware of endianess
| must be aware of endianness
|-
|-
| concatenating strings
| concatenating strings
| must be 8-bit compatible
| must be 8-bit compatible
| must verify the endianess are the same
| must verify the endianness are the same
| must verify the endianess are the same
| must verify the endianness are the same
|-
|-
| displaying strings
| displaying strings
| must pass it to external app without truncating the 8th bit
| must pass it to external app without truncating the 8th bit
| must ensure the external app is UTF-16; must check for endianess
| must ensure the external app is UTF-16; must check for endianness
| must ensure the external app is UTF-32; must check for endianess
| must ensure the external app is UTF-32; must check for endianness
|-
|-
|}
|}
Line 76: Line 76:
=Cons=
=Cons=
* UTF-8: not all sequences are valid
* UTF-8: not all sequences are valid
* UTF-16: not all sequences are valid. When using standard ASCII, memory waste is 50%. While transmitting strings, around half of the chars are zeros.
* UTF-16: not all sequences are valid. Compared to ASCII, memory waste is 50%. While transmitting strings containing Western text, around half of the chars are typically zeros.
* UTF-32: not all sequences are valid. When using standard ASCII, memory waste is 75%. While transmitting strings, around three-quarters of the chars are zeros.
* UTF-32: not all sequences are valid. Compared to ASCII, memory waste is 75%. While transmitting strings containing Western text, around three-quarters of the chars are typically zeros.


=Octave and international support=
=Octave and international support=


== Graphs ==
== Graphs ==
Adding special symbols to graph title, graph labels and inside graphs
Adding special symbols to graph title, labels, legend, tick marks and inside graphs (e.g. text).


== Locales ==
== Locales ==
Converting strings to numbers when the decimal separator is not '.'
(Caution, locale setting is not to be mixed with character sets.)
Converting strings to numbers and vice versa when the decimal separator is not '.' point.
Date-time format (datestr, datenum, strftime, weekday, etc.) and string collation (sort char) are also affected.


== The state of Octave ==
== The state of Octave ==
Line 92: Line 94:
Functions like 'length' and size return the space required for string storage, which may be greater than the effective number of symbols.
Functions like 'length' and size return the space required for string storage, which may be greater than the effective number of symbols.


Tests like 'isalpha' are modelled as their lower-layers C counterpart and are not aware of UTF-8 nor locales
Tests like 'isalpha' are modeled as their lower-layers C counterpart and are not aware of UTF-8 nor locales.


String search should be OK provided the two arguments are UTF-8
String search should be OK provided the two arguments are UTF-8.


Strings concatenation works with respect to UTF-8
Strings concatenation works with respect to UTF-8.


Displaying strings works in most circumstances except help message (bug report and patch recently provided)
Displaying strings works in most circumstances except help message (bug report and patch recently provided).
 
Non-ASCII strings in paths and filenames don't work with glob function on windows os.


== Development ==
== Development ==
* short term: tests to ensure every string processing is 8-bi clean
* short term: tests to ensure every string processing is 8-bit clean
* middle and long term: there are a number of options to fully support whatever symbol existing in Unicode:
* middle and long term: there are a number of options to fully support whatever symbols exist in Unicode:
** make use of C wide_char type
** make use of C <code>wchar_t</code>, <code>char16_t</code>, or <code>char32_t</code> types
** make use of ICU [http://site.icu-project.org/], an open-source lib with various Unicode support functions
** make use of ICU [http://site.icu-project.org/], an open-source lib with various Unicode support functions
[[Category:Development]]

Latest revision as of 10:42, 30 November 2023

ANSI[edit]

The first widely used character set was 7-bit ASCII, with values ranging from 0 to 127. Being developed for English, it uses a Latin character set, but without accents or other punctuation signs (diacritical marks).

In the '80s, extensions were provided by using 8-bit character tables, whose code numbers 128 to 255 were used to encode the missing values. But there were so many characters that those additional 128 values were not enough. So a number of maps (code pages) where defined. For instance, ISO-8859-1 for Western European Languages, with letters for French: é, German: ä, Nordic languages: Ø, a few math symbols: °, µ, ½, and so on. Typical computer support consisted of loading the adequate character map beforehand; then glyphs were rendered correctly.

The first issue with this approach is about conversion. To view some text in Greek or Cyrillic language on a display configured for Western European requires switching back and forth between code pages.

Unicode[edit]

Unicode is a standard and an effort to encode symbols from every language existing or having existed on Earth. There are actually 190,000 signs from 93 languages. Unicode is equivalent to ISO/CEI 10646. Unicode consists of:

  • a table of symbols, each with an unique name, like "GREEK SMALL LETTER ALPHA" for α
  • encoding norms: UTF-16, UTF-32, UTF-8
  • glyphs: a screen representation for each symbol

Storage and binary representation[edit]

  • UTF-32 stores each symbol in 4 bytes according to two schemes: Big Endian and Little Endian
  • UTF-16 stores most of its symbols in 2 bytes; rarely used values are stored using a sequence of "prefix-value" called a "surrogate pair". Two schemes: Big Endian and Little Endian
  • UTF-8 was designed to be mostly compatible with ASCII; symbol storage is either 1, 2, 3, 4 bytes. This scheme is defined sequentially; there is no ambiguity linked to its endianness.

C and C++ support[edit]

There are three types of "char" in C and C++: plain (unqualified), signed and unsigned. The two latters were added to have similar behavior as 'int', as chars may be used to store small numbers. The C++ standard says:

3.9.1 Fundamental types [basic.fundamental]

Objects declared as characters (char) shall be large enough to store any member of the implementation's basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character. It is implementation-defined whether a char object can hold negative values. Characters can be explicitly declared unsigned or signed. Plain char, signed char, and unsigned char are three distinct types. A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (basic.types); that is, they have the same object representation. For character types, all bits of the object representation participate in the value representation. For unsigned character types, all possible bit patterns of the value representation represent numbers. These requirements do not hold for other types. In any particular implementation, a plain char object can take on either the same values as a signed char or an unsigned char; which one is implementation-defined.

What is important here is that usual characters should be declared as "chars" or "signed chars". "Unsigned char" means they MAY be submitted to truncation of the eighth bit, this is implementation-dependent.

In order to support "wide" characters with an extended range of values, the storage type wchar_t was added to the C standard. The size of wchar_t is system dependent: on Windows, it is 2 bytes, and on Linux and macOS it is 4 bytes. C++11 introduced defined-size wide character types char16_t and char32_t. Functions whose argument is wchar_t instead of char are generally prefixed by "w".

Character functions[edit]

Some of the basic functions about characters strings are listed below:

  • length: how many symbols ?
  • tests: is this symbol a letter, a number, a punctuation sign, ... ?
  • finding chars inside a string
  • concatenating strings
  • displaying strings

Here is a brief summary of the "features" of each encoding with respect to those functions:

Function UTF-8 UTF-16 UTF-32
length number of symbols <= storage length number of symbols <= storage length number of symbols proportional to storage length
tests must be aware of UTF-8 and locales must be aware of UTF-16, locales AND endianness must be aware of UTF-32, locales AND endianness
finding symbols must implement a sequential machine for prefix codes or be 8-bit clean and be able to locate sequences must implement a sequential machine for prefix codes; must be aware of endianness must be aware of endianness
concatenating strings must be 8-bit compatible must verify the endianness are the same must verify the endianness are the same
displaying strings must pass it to external app without truncating the 8th bit must ensure the external app is UTF-16; must check for endianness must ensure the external app is UTF-32; must check for endianness

Cons[edit]

  • UTF-8: not all sequences are valid
  • UTF-16: not all sequences are valid. Compared to ASCII, memory waste is 50%. While transmitting strings containing Western text, around half of the chars are typically zeros.
  • UTF-32: not all sequences are valid. Compared to ASCII, memory waste is 75%. While transmitting strings containing Western text, around three-quarters of the chars are typically zeros.

Octave and international support[edit]

Graphs[edit]

Adding special symbols to graph title, labels, legend, tick marks and inside graphs (e.g. text).

Locales[edit]

(Caution, locale setting is not to be mixed with character sets.) Converting strings to numbers and vice versa when the decimal separator is not '.' point. Date-time format (datestr, datenum, strftime, weekday, etc.) and string collation (sort char) are also affected.

The state of Octave[edit]

Octave "by accident" supports UTF-8, meaning that the vast majority of functions for text display and graph manipulations are using 8-bits chars, passing them unmodified to the underlying layers in charge of rendering.

Functions like 'length' and size return the space required for string storage, which may be greater than the effective number of symbols.

Tests like 'isalpha' are modeled as their lower-layers C counterpart and are not aware of UTF-8 nor locales.

String search should be OK provided the two arguments are UTF-8.

Strings concatenation works with respect to UTF-8.

Displaying strings works in most circumstances except help message (bug report and patch recently provided).

Non-ASCII strings in paths and filenames don't work with glob function on windows os.

Development[edit]

  • short term: tests to ensure every string processing is 8-bit clean
  • middle and long term: there are a number of options to fully support whatever symbols exist in Unicode:
    • make use of C wchar_t, char16_t, or char32_t types
    • make use of ICU [1], an open-source lib with various Unicode support functions