Skip to content

Misuse of wchar_t for UTF-16/32 #180

@glebov-andrey

Description

@glebov-andrey

Even though wchar_t does not have a well-defined size in C or C++, LittleCMS uses it to store UTF-16 strings.

The problem is that on platforms with a 32-bit wchar_t (e.g. Linux, OS X), each UTF-16 code unit (e.g. from a file) is simply cast to wchar_t by _cmsReadWCharArray which doesn't produce a valid UTF-32 sequence. To produce a valid UTF-32 sequence one would have to decode each Unicode code point from one or more UTF-16 code units.

To make things worse _cmsWriteWCharArray just down-casts wchar_t to cmsUInt16Number discarding the upper 16 bits of the UTF-32 code unit. This of course does not convert UTF-32 to UTF-16, as the correct conversion would be to decompose each Unicode code point into one or more UTF-16 code units.

Apart from these problems there is also a minor matter of wasted memory - each 32-bit wchar_t has 16 bits unused.

At the very minimum, I would propose documenting the fact that strings created by LittleCMS functions returning wchar_t arrays cannot be used as valid UTF-32 strings on platforms with a 32-bit wchar_t. Instead the user must do their own conversion, using the lower 16 bits of each value to form a valid UTF-16 sequence. And functions that take wchar_t strings should also be documented to take UTF-16 sequences where each code unit is stored in a wchar_t.

Of course, this behaviour is not intuitive or expected by the user and makes handling textual data with LittleCMS more complex (not to mention resource intensive, as each string conversion may require allocating extra memory). For instance, this confusion can cause issues such as #163.

That is why I also propose deprecating the functions that use wchar_t in favour of new functions that operate on some new cmsU16Character type (always 16-bit, name definitely up for discussion), and explicitly state that strings are encoded as UTF-16 in the native byte order. (I would say, just use char16_t but that would require C11/C++11, so a custom type is probably a better choice for compatibility). The new functions could use a UTF16 postfix.
With these new functions the internals would need to be reworked. Instead of storing Unicode text as wchar_t arrays, it should be stored as cmsU16Character arrays. As an added bonus, this would allow for more efficient I/O functions that could process multiple characters at once instead of a character at a time (with the byte-swapping if-statement needing to be called only once for character block).
The deprecated functions would just do the up/down-casting themselves instead of deferring that to the I/O layer.

Another issue that I see is with the ASCII functions as the conversion from UTF-16 to ASCII (cmsMLUgetASCII) is only valid if the original string has no non-ASCII code points in it. A better way to handle this may be to provide the ability to replace non-ASCII code points with a character of the user's choosing (e.g. ?). Of course this would require handling strings as actual sequences of code points instead of opaque code units. (I'm not 100% sure, but this probably doesn't even require fully decoding the UTF-32 code point from UTF-16, just skipping over some of the UTF-16 code units.)

P.S. I would be glad to create a pull-request with some of the proposed changes if a decision is made to proceed with them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions