Editing International Characters Support

Jump to navigation Jump to search
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 26: Line 26:
<q>Objects declared as characters (char) shall be large enough to store any member of the implementation's basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character. It is implementation-defined whether a char object can hold negative values. Characters can be explicitly declared unsigned or signed. Plain char, signed char, and unsigned char are three distinct types. A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (basic.types); that is, they have the same object representation. For character types, all bits of the object representation participate in the value representation. For unsigned character types, all possible bit patterns of the value representation represent numbers. These requirements do not hold for other types. In any particular implementation, a plain char object can take on either the same values as a signed char or an unsigned char; which one is implementation-defined.</q>
<q>Objects declared as characters (char) shall be large enough to store any member of the implementation's basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character. It is implementation-defined whether a char object can hold negative values. Characters can be explicitly declared unsigned or signed. Plain char, signed char, and unsigned char are three distinct types. A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (basic.types); that is, they have the same object representation. For character types, all bits of the object representation participate in the value representation. For unsigned character types, all possible bit patterns of the value representation represent numbers. These requirements do not hold for other types. In any particular implementation, a plain char object can take on either the same values as a signed char or an unsigned char; which one is implementation-defined.</q>


What is important here is that usual characters should be declared as "chars" or "signed chars". "Unsigned char" means they MAY be submitted to truncation of the eighth bit, this is implementation-dependent.
What is important here is that usual characters should be declared as "chars" or "signed chars". "Unsigned char" means they MAY be submitted to truncation of the eighth bit, this is implementation-dependant.


In order to support "wide" characters with an extended range of values, the storage type <code>wchar_t</code> was added to the C standard. The size of <code>wchar_t</code> is system dependent: on Windows, it is 2 bytes, and on Linux and macOS it is 4 bytes. C++11 introduced defined-size wide character types <code>char16_t</code> and <code>char32_t</code>. Functions whose argument is <code>wchar_t</code> instead of <code>char</code> are generally prefixed by "w".
In order to support wide-characters, the two-byte storage wchar_t was added to the C standard. Functions whose argument is wchar instead of char are generally prefixed by "w".


=Character functions=
=Character functions=
Line 94: Line 94:
Functions like 'length' and size return the space required for string storage, which may be greater than the effective number of symbols.
Functions like 'length' and size return the space required for string storage, which may be greater than the effective number of symbols.


Tests like 'isalpha' are modeled as their lower-layers C counterpart and are not aware of UTF-8 nor locales.
Tests like 'isalpha' are modelled as their lower-layers C counterpart and are not aware of UTF-8 nor locales


String search should be OK provided the two arguments are UTF-8.
String search should be OK provided the two arguments are UTF-8


Strings concatenation works with respect to UTF-8.
Strings concatenation works with respect to UTF-8


Displaying strings works in most circumstances except help message (bug report and patch recently provided).
Displaying strings works in most circumstances except help message (bug report and patch recently provided)


Non-ASCII strings in paths and filenames don't work with glob function on windows os.
Using non-ASCII strings as paths doesn't works.


== Development ==
== Development ==
* short term: tests to ensure every string processing is 8-bit clean
* short term: tests to ensure every string processing is 8-bit clean
* middle and long term: there are a number of options to fully support whatever symbols exist in Unicode:
* middle and long term: there are a number of options to fully support whatever symbol existing in Unicode:
** make use of C <code>wchar_t</code>, <code>char16_t</code>, or <code>char32_t</code> types
** make use of C wchar_t type
** make use of ICU [http://site.icu-project.org/], an open-source lib with various Unicode support functions
** make use of ICU [http://site.icu-project.org/], an open-source lib with various Unicode support functions


[[Category:Development]]
[[Category:Development]]
Please note that all contributions to Octave may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see Octave:Copyrights for details). Do not submit copyrighted work without permission!

To edit this page, please answer the question that appears below (more info):

Cancel Editing help (opens in new window)