Sunday, June 29, 2014

UTF-8 Progress Update and Examples of Index Caching

Progress update:  I'm still working on replace unicodes in Pypy with a UTF-8 implementation.  Things are progressing more slowly than I'd hoped they might.  At this point I have most of the codecs working and most of the application-level unicode tests passing, but there are still a lot of little corner cases to fix and I don't even want to think about regex right now.  That said, the summer is only about half over.

Since I wanted to have something interesting to write about, I've done some looking into how other implementations that use a UTF-8 representation handle index caching (if at all) and thought I'd share what I've found.

Relatively few languages with built-in unicode support use a UTF-8 representation.  The majority seem to prefer UTF-16, for better or for worse.  The implementations that use UTF-8 I found which are worth mentioning are Perl and wxWidgets.

wxWidgets by default uses UTF-16 or UTF-32 depending on the size of wchar_t, but it has an option to using UTF-8.  When UTF-8 is enabled, wxWidgets uses an extremely simple index caching system.  It caches the byte position of the last accessed character.  When indexing a string, it starts searching from the cached byte position if the index being looked up is higher than the cached index, otherwise it searches from the start of the string.  It will never search backwards. The idea here is to make in-order traversal -- like a typical for(int i = 0; i < s.size(); i++) loop -- O(n). Unfortunately, a backwards traversal is still O(n^2) and random access is still O(n).

Perl's index caching is much more elaborate.  Perl caches two character/byte-index pairs.  When indexing a string, the search will start from the closest of the cached indices, the start, or the end of the string. It can count backwards as well as forwards through the string, assuming that counting backwards is approximately twice as slow as counting forward.

Perl's method for selecting the cached indices is also more sophisticated than wx's. Initially, the first two indices accessed are used for the cached indices. After the first two are cached, when a new index is accessed, it considers replacing the current two cached indices with one of the two possible pairs of one of the old indices and and the new index.  It does so by selecting the root-mean-square distance between the start of the string, the first index, the second index and the end of the string for each the 3 possible pairs of indices.  That's probably as clear as mud, so this maybe this expert from the Perl source will help:0
#define THREEWAY_SQUARE(a,b,c,d) \
            ((float)((d) - (c))) * ((float)((d) - (c))) \
            + ((float)((c) - (b))) * ((float)((c) - (b))) \
               + ((float)((b) - (a))) * ((float)((b) - (a)))

The pair that minimizes THREEWAY_SQUARE(start, low index, high index, end) is kept.

This method seems to be better than wx's for almost all cases except in-order traversal. I'm not actually sure what the complexity here would be;  I think its still O(n) for random access.

Sunday, June 15, 2014

Beginning UTF-8 Implementation

Over the last week and a half, I've begun working on implementing UTF-8 based unicodes in PyPy.

As I've stated in a previous blog post, my goal is to replace the use of the RPython unicode type in PyPy with a PyPy-specific UTF-8 implementation.  The motivation for doing this is to simplify PyPy's unicode support by providing a single implementation and application-level interface irrespective of the platform.  Presently, the size of the system's wchar_t type determines which representation is used, from both perspectives.  The new implementation will represent the unicode strings internally as an RPython string of bytes, but the user will interact with the application-level unicode string on a code point basis.  In other words, the user will have a "UTF-32" interface.

Since Python supports converting unicode objects into UTF-8 byte strings and vice-versa, RPython helper functions already exist for a lot of the handling of UTF-8. I am largely reusing these with minimal modifications.  As a result, most of my work is a matter of fixing the various places in PyPy where the built-in/RPython unicode type is expected and replacing unicode literals.  That said, I plan on leaving unicode literals in (the interpreter-level) tests in place. Replacing them serves little useful purpose and would make the tests more annoying to read and to write.

For the time being I'm not paying very much attention to optimization (with the exception of special cases for ASCII only strings.)  Things will be slow for a little while; random access before any optimizations are added will be O(n).  However, the goal is correctness first and this won't be merged if it creates any serious performance regressions.  That said, I already have some ideas about optimizations and hopefully I'll be far enough along in my work to be able to write more about those next time.

Sunday, June 1, 2014

Bytearray status update

I'm almost done with fixing the complexity issues with bytearrays.

One of the challenges in doing this was that bytearrays support most of the same operations that strings do in Python, except they accept arbitrary objects that support the buffer protocol as a second (or third) operand.  For example:

>>> 'abc' + memoryview('def')
Traceback (most recent call last):
  File "", line 1, in 
TypeError: cannot concatenate 'str' and 'memoryview' objects
>>> bytearray('abc') + memoryview('def')
bytearray(b'abcdef')
>>> from string import maketrans
>>> t = memoryview(maketrans('abc', '123'))
>>> 'abc'.translate(t)
Traceback (most recent call last):
  File "", line 1, in 
TypeError: expected a character buffer object
>>> bytearray('abc').translate(t)
bytearray(b'123')

To deal with this I added support for __getitem__ and __len__ (and a couple of other magic methods) to RPython and did some refactoring so that the same code could be used to handle buffer operands that is used to handle string operands. There are still a few places where the parameters of methods of bytearrays are still copied to strings, but the important thing is the bytearray itself isn't copied unnecessarily any more.  I'll clean up those few places over the next day or so.

Here are some examples that demonstrate the preformance improvements I have been working on:

% pypy --version
Python 2.7.6 (394146e9bb67, May 08 2014, 16:45:59)
[PyPy 2.3.0 with GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)]
% ./pypy-c --version
Python 2.7.6 (d91c375d8356+, Jun 01 2014, 19:59:07)
[PyPy 2.4.0-alpha0 with GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)]

% pypy -m timeit -s "data = bytearray('1' * 1000000)" "data + data"
100 loops, best of 3: 10.5 msec per loop
% ./pypy-c -m timeit -s "data = bytearray('1' * 1000000)" "data + data"
1000 loops, best of 3: 669 usec per loop
% python -m timeit -s "data = bytearray('1' * 1000000)" "data + data"
10000 loops, best of 3: 102 usec per loop

% pypy -m timeit -s "data = bytearray('1' * 1000000)" "data[0:10]"
1000 loops, best of 3: 827 usec per loop
% ./pypy-c -m timeit -s "data = bytearray('1' * 1000000)" "data[0:10]"
100000000 loops, best of 3: 0.00963 usec per loop
% python -m timeit -s "data = bytearray('1' * 1000000)" "data[0:10]"
10000000 loops, best of 3: 0.13 usec per loop

% pypy -m timeit -s "data = bytearray(''.join(str(i) for i in range(1000)))" "data.index('32')"
1000000 loops, best of 3: 1.57 usec per loop
% ./pypy-c -m timeit -s "data = bytearray(''.join(str(i) for i in range(1000)))" "data.index('32')"
10000000 loops, best of 3: 0.183 usec per loop
% python -m timeit -s "data = bytearray(''.join(str(i) for i in range(1000)))" "data.index('32')"
1000000 loops, best of 3: 0.216 usec per loop

As you can see, the improvement (in these admittedly very artificial examples) is dramatic.  Things are still slower than CPython. I suspect that has to do with CPython being able to work with the underlying array directly.  I doubt these differences will be nearly so pronounced for real-world workloads.

Tomorrow, I'll look to get my work merged.  Early next week I'll get started on working on the unicode implementation.  More about that next time.