Tyler Wade GSoC 2014: UTF-8 Progress Update and Examples of Index Caching

Progress update: I'm still working on replace unicodes in Pypy with a UTF-8 implementation. Things are progressing more slowly than I'd hoped they might. At this point I have most of the codecs working and most of the application-level unicode tests passing, but there are still a lot of little corner cases to fix and I don't even want to think about regex right now. That said, the summer is only about half over.

Since I wanted to have something interesting to write about, I've done some looking into how other implementations that use a UTF-8 representation handle index caching (if at all) and thought I'd share what I've found.

Relatively few languages with built-in unicode support use a UTF-8 representation. The majority seem to prefer UTF-16, for better or for worse. The implementations that use UTF-8 I found which are worth mentioning are Perl and wxWidgets.

wxWidgets by default uses UTF-16 or UTF-32 depending on the size of wchar_t, but it has an option to using UTF-8. When UTF-8 is enabled, wxWidgets uses an extremely simple index caching system. It caches the byte position of the last accessed character. When indexing a string, it starts searching from the cached byte position if the index being looked up is higher than the cached index, otherwise it searches from the start of the string. It will never search backwards. The idea here is to make in-order traversal -- like a typical for(int i = 0; i < s.size(); i++) loop -- O(n). Unfortunately, a backwards traversal is still O(n^2) and random access is still O(n).

Perl's index caching is much more elaborate. Perl caches two character/byte-index pairs. When indexing a string, the search will start from the closest of the cached indices, the start, or the end of the string. It can count backwards as well as forwards through the string, assuming that counting backwards is approximately twice as slow as counting forward.

Perl's method for selecting the cached indices is also more sophisticated than wx's. Initially, the first two indices accessed are used for the cached indices. After the first two are cached, when a new index is accessed, it considers replacing the current two cached indices with one of the two possible pairs of one of the old indices and and the new index. It does so by selecting the root-mean-square distance between the start of the string, the first index, the second index and the end of the string for each the 3 possible pairs of indices. That's probably as clear as mud, so this maybe this expert from the Perl source will help:0

#define THREEWAY_SQUARE(a,b,c,d) \
            ((float)((d) - (c))) * ((float)((d) - (c))) \
            + ((float)((c) - (b))) * ((float)((c) - (b))) \
               + ((float)((b) - (a))) * ((float)((b) - (a)))

The pair that minimizes THREEWAY_SQUARE(start, low index, high index, end) is kept.

This method seems to be better than wx's for almost all cases except in-order traversal. I'm not actually sure what the complexity here would be; I think its still O(n) for random access.

Tyler Wade GSoC 2014

Sunday, June 29, 2014

UTF-8 Progress Update and Examples of Index Caching

No comments:

Post a Comment