Monday, August 11, 2014

The final status update:  PyPy with utf-8 unicodes translates and, with the exception of one last test that is still failing, works correctly (or at least appears to.)  Fixing translation issues was easier than I thought it would be, so I most spent the last two weeks fixing bugs that resulted from differences between the behavior of the translated and untranslated code.  For example, although RPython supports using __len__ with the built-in len() function, it doesn't use __len__ when evaluating a object as a boolean.  As a result, there were a number of locations where the compiled interpreter would seg-fault or hang even though the untranslated code worked correctly.

There's still a few of things left to do in the last week; the aforementioned test being one of them.  Unfortunately, I doubt that there will time for my branch to be reviewed and merged before the summer is over, but I think I'll definitely finish the work I have left to do.


Monday, July 28, 2014

Unfortunately, I don't really have much progress to report.  I've been working fixing translation the last week or so.

I've spent an embarrassing amount of time working on supporting the == operator on Utf8Str (which is the class that wraps a utf8-encoded string internally.)  RPython generally doesn't support magic methods, but I had already added support for magic methods such as __getitem__ and __len__, so I decided to add support for __eq__.  Although I did have the option of just using an 'equals' method (like one would do in Java) rather than use the == operator, being able to use a Utf8Str as if it were a regular string is ideal.

The problem with implementing support for a method like __eq__ is that RPython is much more static than Python.  Whether or not to call __eq__ or to use the built-in equality operator needs to be determined at translation time rather than at run time.  Consider the following example:

class Base(object):
    pass

class Derived(Base):
    def __eq__(self, other):
        return True

def f(a):
    if a:
        obj = Base()
    else:
        obj = Derived()

    return obj == ''

When translating a function such as f, the RPython type annotator reduces the type of the variable obj to the common base class.  This means that when considering the statement ``obj == ''``, the type annotator will consider obj to be an instance of Base and won't consider that an __eq__ method may be defined in some cases.  After translation, Derived.__eq__ would never be called.  My solution was to raise an error anytime a class is encountered that defines __eq__ and has a base class which doesn't.  That is to say, the class at the root of the hierarchy must define __eq__ if any of it's subclasses are to.

Sunday, July 13, 2014

Status update

Unfortunately, I don't have much interesting to talk about this week. Lately I've been working on fixing the parts of the built-in modules that were broken by moving to UTF-8.  Most of the issues stem from the existing code expecting each unicode character to map cleanly onto one wchar_t.  Also, since the RPython library for interacting with native code, known as "rffi", assumes that you want to use RPython unicode objects when manipulating arrays of wchar_t, some extra massaging has to be done to cast to the appropriate int type.  Working on this, I finally had to set a Windows VM to test 16-bit wchar_t, which I managed to avoid until now somehow.

When I finish the modules, I'll be able to start working on making everything translate.

Sunday, June 29, 2014

UTF-8 Progress Update and Examples of Index Caching

Progress update:  I'm still working on replace unicodes in Pypy with a UTF-8 implementation.  Things are progressing more slowly than I'd hoped they might.  At this point I have most of the codecs working and most of the application-level unicode tests passing, but there are still a lot of little corner cases to fix and I don't even want to think about regex right now.  That said, the summer is only about half over.

Since I wanted to have something interesting to write about, I've done some looking into how other implementations that use a UTF-8 representation handle index caching (if at all) and thought I'd share what I've found.

Relatively few languages with built-in unicode support use a UTF-8 representation.  The majority seem to prefer UTF-16, for better or for worse.  The implementations that use UTF-8 I found which are worth mentioning are Perl and wxWidgets.

wxWidgets by default uses UTF-16 or UTF-32 depending on the size of wchar_t, but it has an option to using UTF-8.  When UTF-8 is enabled, wxWidgets uses an extremely simple index caching system.  It caches the byte position of the last accessed character.  When indexing a string, it starts searching from the cached byte position if the index being looked up is higher than the cached index, otherwise it searches from the start of the string.  It will never search backwards. The idea here is to make in-order traversal -- like a typical for(int i = 0; i < s.size(); i++) loop -- O(n). Unfortunately, a backwards traversal is still O(n^2) and random access is still O(n).

Perl's index caching is much more elaborate.  Perl caches two character/byte-index pairs.  When indexing a string, the search will start from the closest of the cached indices, the start, or the end of the string. It can count backwards as well as forwards through the string, assuming that counting backwards is approximately twice as slow as counting forward.

Perl's method for selecting the cached indices is also more sophisticated than wx's. Initially, the first two indices accessed are used for the cached indices. After the first two are cached, when a new index is accessed, it considers replacing the current two cached indices with one of the two possible pairs of one of the old indices and and the new index.  It does so by selecting the root-mean-square distance between the start of the string, the first index, the second index and the end of the string for each the 3 possible pairs of indices.  That's probably as clear as mud, so this maybe this expert from the Perl source will help:0
#define THREEWAY_SQUARE(a,b,c,d) \
            ((float)((d) - (c))) * ((float)((d) - (c))) \
            + ((float)((c) - (b))) * ((float)((c) - (b))) \
               + ((float)((b) - (a))) * ((float)((b) - (a)))

The pair that minimizes THREEWAY_SQUARE(start, low index, high index, end) is kept.

This method seems to be better than wx's for almost all cases except in-order traversal. I'm not actually sure what the complexity here would be;  I think its still O(n) for random access.

Sunday, June 15, 2014

Beginning UTF-8 Implementation

Over the last week and a half, I've begun working on implementing UTF-8 based unicodes in PyPy.

As I've stated in a previous blog post, my goal is to replace the use of the RPython unicode type in PyPy with a PyPy-specific UTF-8 implementation.  The motivation for doing this is to simplify PyPy's unicode support by providing a single implementation and application-level interface irrespective of the platform.  Presently, the size of the system's wchar_t type determines which representation is used, from both perspectives.  The new implementation will represent the unicode strings internally as an RPython string of bytes, but the user will interact with the application-level unicode string on a code point basis.  In other words, the user will have a "UTF-32" interface.

Since Python supports converting unicode objects into UTF-8 byte strings and vice-versa, RPython helper functions already exist for a lot of the handling of UTF-8. I am largely reusing these with minimal modifications.  As a result, most of my work is a matter of fixing the various places in PyPy where the built-in/RPython unicode type is expected and replacing unicode literals.  That said, I plan on leaving unicode literals in (the interpreter-level) tests in place. Replacing them serves little useful purpose and would make the tests more annoying to read and to write.

For the time being I'm not paying very much attention to optimization (with the exception of special cases for ASCII only strings.)  Things will be slow for a little while; random access before any optimizations are added will be O(n).  However, the goal is correctness first and this won't be merged if it creates any serious performance regressions.  That said, I already have some ideas about optimizations and hopefully I'll be far enough along in my work to be able to write more about those next time.

Sunday, June 1, 2014

Bytearray status update

I'm almost done with fixing the complexity issues with bytearrays.

One of the challenges in doing this was that bytearrays support most of the same operations that strings do in Python, except they accept arbitrary objects that support the buffer protocol as a second (or third) operand.  For example:

>>> 'abc' + memoryview('def')
Traceback (most recent call last):
  File "", line 1, in 
TypeError: cannot concatenate 'str' and 'memoryview' objects
>>> bytearray('abc') + memoryview('def')
bytearray(b'abcdef')
>>> from string import maketrans
>>> t = memoryview(maketrans('abc', '123'))
>>> 'abc'.translate(t)
Traceback (most recent call last):
  File "", line 1, in 
TypeError: expected a character buffer object
>>> bytearray('abc').translate(t)
bytearray(b'123')

To deal with this I added support for __getitem__ and __len__ (and a couple of other magic methods) to RPython and did some refactoring so that the same code could be used to handle buffer operands that is used to handle string operands. There are still a few places where the parameters of methods of bytearrays are still copied to strings, but the important thing is the bytearray itself isn't copied unnecessarily any more.  I'll clean up those few places over the next day or so.

Here are some examples that demonstrate the preformance improvements I have been working on:

% pypy --version
Python 2.7.6 (394146e9bb67, May 08 2014, 16:45:59)
[PyPy 2.3.0 with GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)]
% ./pypy-c --version
Python 2.7.6 (d91c375d8356+, Jun 01 2014, 19:59:07)
[PyPy 2.4.0-alpha0 with GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)]

% pypy -m timeit -s "data = bytearray('1' * 1000000)" "data + data"
100 loops, best of 3: 10.5 msec per loop
% ./pypy-c -m timeit -s "data = bytearray('1' * 1000000)" "data + data"
1000 loops, best of 3: 669 usec per loop
% python -m timeit -s "data = bytearray('1' * 1000000)" "data + data"
10000 loops, best of 3: 102 usec per loop

% pypy -m timeit -s "data = bytearray('1' * 1000000)" "data[0:10]"
1000 loops, best of 3: 827 usec per loop
% ./pypy-c -m timeit -s "data = bytearray('1' * 1000000)" "data[0:10]"
100000000 loops, best of 3: 0.00963 usec per loop
% python -m timeit -s "data = bytearray('1' * 1000000)" "data[0:10]"
10000000 loops, best of 3: 0.13 usec per loop

% pypy -m timeit -s "data = bytearray(''.join(str(i) for i in range(1000)))" "data.index('32')"
1000000 loops, best of 3: 1.57 usec per loop
% ./pypy-c -m timeit -s "data = bytearray(''.join(str(i) for i in range(1000)))" "data.index('32')"
10000000 loops, best of 3: 0.183 usec per loop
% python -m timeit -s "data = bytearray(''.join(str(i) for i in range(1000)))" "data.index('32')"
1000000 loops, best of 3: 0.216 usec per loop

As you can see, the improvement (in these admittedly very artificial examples) is dramatic.  Things are still slower than CPython. I suspect that has to do with CPython being able to work with the underlying array directly.  I doubt these differences will be nearly so pronounced for real-world workloads.

Tomorrow, I'll look to get my work merged.  Early next week I'll get started on working on the unicode implementation.  More about that next time.

Sunday, May 18, 2014

Introduction

About me:

I am from Topeka, Kansas. I just finished a bachelor's degree in computer science and will start working on a Master's degree in the fall.
This is the second year I have been selected to participate in a GSoC project. Last year I worked on creating a cffi-port of wxPython, which I, unfortunately, was not able to finish. This summer I will be working on another project with PyPy.

Unfortunately, classes only ended last Friday for me, so I did not have of an opportunity to do much for this project during the community bonding period.

About the project:

The title of my project is: Improvements for Bytearrays and Unicode Strings in PyPy. The project has two separate parts (as you could probably guess from the title):

The first part of my project will be to fix bytearrays in PyPy. Currently, the complexity of many of the operations on bytearrays are incorrect. At some point, the str, unicode, and bytearray implementations were refactored so they shared more code. Unfortunately, this caused a number of operations on bytearrays to create an RPython string representation of the array, which requires making a copy (bytearrays being mutable and RPython strings being immutable.) This was included in my project since it was something that really needed to be fixed and to help make sure the project would fill the entire summer.  This is what I will be working on first.

The more significant part of my project will be to convert PyPy's unicode implementation to UTF-8.  Currently, unicode objects in PyPy internally use RPython unicode objects, which use either UTF-16 or UTF-32 depending on the size of the platform's wchar_t. The idea is to simplify the implementation somewhat by implementing only one representation while providing the same application-level interface, independent of the platform.  I'll have more to write about this when I start working on it.