Misuse of wchar_t for UTF-16/32

Even though `wchar_t` does not have a well-defined size in C or C++, LittleCMS uses it to store UTF-16 strings.

The problem is that on platforms with a 32-bit `wchar_t` (e.g. Linux, OS X), each UTF-16 code unit (e.g. from a file) is simply cast to `wchar_t` by `_cmsReadWCharArray` which doesn't produce a valid UTF-32 sequence. To produce a valid UTF-32 sequence one would have to decode each Unicode code point from one or more UTF-16 code units.

To make things worse `_cmsWriteWCharArray` just down-casts `wchar_t` to `cmsUInt16Number` discarding the upper 16 bits of the UTF-32 code unit. This of course does not convert UTF-32 to UTF-16, as the correct conversion would be to decompose each Unicode code point into one or more UTF-16 code units.

Apart from these problems there is also a minor matter of wasted memory - each 32-bit `wchar_t` has 16 bits unused.

At the very minimum, I would propose documenting the fact that strings created by LittleCMS functions returning `wchar_t` arrays cannot be used as valid UTF-32 strings on platforms with a 32-bit `wchar_t`. Instead the user must do their own conversion, using the lower 16 bits of each value to form a valid UTF-16 sequence. And functions that take `wchar_t` strings should also be documented to take UTF-16 sequences where each code unit is stored in a `wchar_t`.

Of course, this behaviour is not intuitive or expected by the user and makes handling textual data with LittleCMS more complex (not to mention resource intensive, as each string conversion may require allocating extra memory). For instance, this confusion can cause issues such as #163.

That is why I also propose deprecating the functions that use `wchar_t` in favour of new functions that operate on some new `cmsU16Character` type (always 16-bit, name definitely up for discussion), and explicitly state that strings are encoded as UTF-16 in the native byte order. (I would say, just use `char16_t` but that would require C11/C++11, so a custom type is probably a better choice for compatibility). The new functions could use a `UTF16` postfix.
With these new functions the internals would need to be reworked. Instead of storing Unicode text as `wchar_t` arrays, it should be stored as `cmsU16Character` arrays. As an added bonus, this would allow for more efficient I/O functions that could process multiple characters at once instead of a character at a time (with the byte-swapping `if`-statement needing to be called only once for character block).
The deprecated functions would just do the up/down-casting themselves instead of deferring that to the I/O layer.

Another issue that I see is with the `ASCII` functions as the conversion from UTF-16 to ASCII (`cmsMLUgetASCII`) is only valid if the original string has no non-ASCII code points in it. A better way to handle this may be to provide the ability to replace non-ASCII code points with a character of the user's choosing (e.g. `?`). Of course this would require handling strings as actual sequences of code points instead of opaque code units. (I'm not 100% sure, but this probably doesn't even require fully decoding the UTF-32 code point from UTF-16, just skipping over some of the UTF-16 code units.)

P.S. I would be glad to create a pull-request with some of the proposed changes if a decision is made to proceed with them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misuse of wchar_t for UTF-16/32 #180

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Misuse of wchar_t for UTF-16/32 #180

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions