Khmer Encoding Structure

One of the primary concerns with this revision of the encoding is to ensure the one visual form only has one encoding. No two ways to encode the same visual representation. Two strings of code points should not be rendered identically. Otherwise, it would lead to confusion among users, difficulties in consistently and correctly typing certain words and challenges in searching and sorting.

The results of this research showing exisiting issues in the current practices of users of the language and recommend what can be done to mitigate those issuses effectively and efficiently without stressing out users. It can be achieved by means of simplified typing that does not require any prior knowledge of Khmer Unicode. Users can type however they want knowing fully well that the output strings are consistent 100% of the time.

Why Do We Need Consistency in Encoding?

The current situation, for Khmer, is that users are expected to have a relatively deep understanding for how Khmer is encoded in Unicode in order to type correctly. They are expected to type their coengs, vowels and other diacritics and signs in the right order with nothing to help them visually.

To make the matter worse, there is often more than one way to encode a Khmer phrase and have it render prefectly, and yet only one way is the “correct” way to encode it. For those very experienced in understanding how the Khmer script is stored in Unicode, and with a strong linguistic awareness, there seems to be no problem. But for many users there are problems.

Mitigations

We aim to help implementers create solutions that allow users to not have to know about Unicode to be able to type their language correctly. We hope that this document will lead to others that will enable font designers and keyboard implementers to produce fonts and tools that work consistently and enable consistent data entry and rerendering. We do this by first concentrating on describing an unambiguous encoding structure for Khmer.

If there is no expected order, users have to be aware of encoding issuses. This is the problem this document attempts to resolve: What is the agreed single correct order to be used?

One Visual Form, One Encoding

If there is only one way to encode something, then it makes it easier to produce a system that works with that one way. Thus, if there is only one way to stored a word, an input method can take a variety of ways that a users might type a word and normalize them into the single correct way. On the other hand, if there multiple ways to encode a word, then the input method cannot do this and the user is expected to resolve the ambiguity and pick the right ordering.

In the technical context where a keyboard only allows a user to type single code points (or short sequences), there has been no other option. Users want to be able to type in a variety of orders and therefore have made use of the existing ambiguities in the encoding. But modern keyboard applications are far more sophisticated and are capable of allowing different typing orders and outputting a single correct order.

Tools

With a defined well-formed regex of the orthography syllable structure, we exercise the idea of “One visual form, one encoding”. Fonts should expose invalid character sequences so that users can know and correct their mishap manually, or keyboards should be built in a way that it is smart enough to transform an invalid string to a valid one. Developers can take advantage of a normalizer that normalizes Modern Khmer text to the encoding previously defined in such a way that it looks the same as the input text. Thus if there is bad spelling in the original (for example inappropriate multiple vowels), this code fixes the errors and returns only valid strings corresponding with those.

Khmer Busra Font

ensure no identical rendering of the same words with different encodings

Khmer Angkor Keyboard

ease the typing experiences and ensure consistent outputs

Normalizer Code

normalize text data and ensure there is no ambiguous strings in the mix

Try it to experience for yourself here!

Try it to experience for yourself here!

Contributors

ADDRESS

National Road 6A, Kthor, Prek Leap Chroy ​Changvar, Phnom Penh, Cambodia

CONTACT US

Phone: +855 10 344 040

Email: pr@cadt.edu.kh