Thursday, February 25, 2016

C-API Support update

As you know, PyPy can emulate the CPython C API to some extent. In this post I will describe an important optimization that we merged to improve the performance and stability of the C-API emulation layer.

The C-API is implemented by passing around PyObject * pointers in the C code. The problem with providing the same interface with PyPy is that objects don't natively have the same PyObject * structure at all; and additionally their memory address can change. PyPy handles the difference by maintaining two sets of objects. More precisely, starting from a PyPy object, it can allocate on demand a PyObject structure and fill it with information that points back to the original PyPy objects; and conversely, starting from a C-level object, it can allocate a PyPy-level object and fill it with information in the opposite direction.

I have merged a rewrite of the interaction between C-API C-level objects and PyPy's interpreter level objects. This is mostly a simplification based on a small hack in our garbage collector. This hack makes the garbage collector aware of the reference-counted PyObject structures. When it considers a pair consisting of a PyPy object and a PyObject, it will always free either none or both of them at the same time. They both stay alive if either there is a regular GC reference to the PyPy object, or the reference counter in the PyObject is bigger than zero.

This gives a more stable result. Previously, a PyPy object might grow a corresponding PyObject, loose it (when its reference counter goes to zero), and later have another corresponding PyObject re-created at a different address. Now, once a link is created, it remains alive until both objects die.

The rewrite significantly simplifies our previous code (which used to be based on at least 4 different dictionaries), and should make using the C-API somewhat faster (though it is still slower than using pure python or cffi).

A side effect of this work is that now PyPy actually supports the upstream lxml package---which is is one of the most popular packages on PyPI. (Specifically, you need version 3.5.0 with this pull request to remove old PyPy-specific hacks that were not really working. See details.) At this point, we no longer recommend using the cffi-lxml alternative: although it may still be faster, it might be incomplete and old.

We are actively working on extending our C-API support, and hope to soon merge a branch to support more of the C-API functions (some numpy news coming!). Please try it out and let us know how it works for you.

Armin Rigo and the PyPy team