Sunday, June 15, 2014

Beginning UTF-8 Implementation

Over the last week and a half, I've begun working on implementing UTF-8 based unicodes in PyPy.

As I've stated in a previous blog post, my goal is to replace the use of the RPython unicode type in PyPy with a PyPy-specific UTF-8 implementation.  The motivation for doing this is to simplify PyPy's unicode support by providing a single implementation and application-level interface irrespective of the platform.  Presently, the size of the system's wchar_t type determines which representation is used, from both perspectives.  The new implementation will represent the unicode strings internally as an RPython string of bytes, but the user will interact with the application-level unicode string on a code point basis.  In other words, the user will have a "UTF-32" interface.

Since Python supports converting unicode objects into UTF-8 byte strings and vice-versa, RPython helper functions already exist for a lot of the handling of UTF-8. I am largely reusing these with minimal modifications.  As a result, most of my work is a matter of fixing the various places in PyPy where the built-in/RPython unicode type is expected and replacing unicode literals.  That said, I plan on leaving unicode literals in (the interpreter-level) tests in place. Replacing them serves little useful purpose and would make the tests more annoying to read and to write.

For the time being I'm not paying very much attention to optimization (with the exception of special cases for ASCII only strings.)  Things will be slow for a little while; random access before any optimizations are added will be O(n).  However, the goal is correctness first and this won't be merged if it creates any serious performance regressions.  That said, I already have some ideas about optimizations and hopefully I'll be far enough along in my work to be able to write more about those next time.

No comments:

Post a Comment