Wednesday, February 25, 2015

Experiments in Pyrlang with RPython

Pyrlang is an Erlang BEAM bytecode interpreter written in RPython.

It implements approximately 25% of BEAM instructions. It can support integer calculations (but not bigint), closures, exception handling, some operators to atom, list and tuple, user modules, and multi-process in single core. Pyrlang is still in development.

There are some differences between BEAM and the VM of PyPy:

  • BEAM is a register-based VM, whereas the VM in PyPy is stack-based.
  • There is no traditional call-stack in BEAM. The Y register in BEAM is similar to a call-stack, but the Y register can sometimes store some variables.
  • There are no typical language-level threads and OS-level threads in BEAM; only language-level processes, whose behavior is very similar to the actor model.

Regarding bytecode dispatch loop, Pyrlang uses a while loop to fetch instructions and operands, call the function corresponding to every instruction, and jump back to the head of the while loop. Due to the differences between the RPython call-stack and BEAM’s Y register, we decided to implement and manage the Y register by hand. On the other hand, PyPy uses RPython’s call stack to implement Python’s call stack. As a result, the function for the dispatch loop in PyPy calls itself recursively. This does not happen in Pyrlang.

The Erlang compiler (erlc) usually compiles the bytecode instructions for function invocation into CALL (for normal invocation) and CALL_ONLY (for tail recursive invocation). You can use a trampoline semantic to implement it:

  • CALL instruction: The VM pushes the current instruction pointer (or called-program counter in PyPy) to the Y register, and jumps to the destination label. When encountering a RETURN instruction, the VM pops the instruction pointer from the Y register and returns to the location of the instruction pointer to continue executing the outer function.
  • CALL_ONLY instruction: The VM simply jumps to the destination label, without any modification of the Y register. As a result, the tail recursive invocation never increases the Y register.

The current implementation only inserts the JIT hint of can_enter_jit following the CALL_ONLY instruction. This means that the JIT only traces the tail-recursive invocation in Erlang code, which has a very similar semantic to the loop in imperative programming languages like Python.

We have also written a single scheduler to implement the language level process in a single core. There is a runable queue in the scheduler. On each iteration, the scheduler pops one element (which is a process object with dispatch loop) from the queue, and executes the dispatch loop of the process object. In the dispatch loop, however, there is a counter-call “reduction” inside the dispatch loop. The reduction decrements during the execution of the loop, and when the reduction becomes 0, the dispatch loop terminates. Then the scheduler pushes that element into the runable queue again, and pops the next element for the queue, and so on.

We are planning to implement a multi-process scheduler for multi-core CPUs, which will require multiple schedulers and even multiple runable queues for each core, but that will be another story. :-)

Methods

We wrote two benchmark programs of Erlang:

  • FACT: A benchmark to calculate the factorial in a tail-recursive style, but because we haven’t implemented big int, we do a remainder calculation to the argument for the next iteration, so the number never overflows.
  • REVERSE: The benchmark creates a reversed list of numbers, such as [20000, 19999, 19998, …], and applies a bubble sort to it.

Results

The Value of Reduction

We used REVERSE to evaluate the JIT with different values of reduction:

The X axis is the value of reduction, and the Y axis is the execution time (by second).

It seems that when the value of reduction is small, the reduction influences the performance significantly, but when reduction becomes larger, it only increases the speed very slightly. In fact, we use 2000 as the default reduction value (as well as the reduction value in the official Erlang interpreter).

Surprisingly, the trace is always generated even when the reduction is very small, such as 0, which means the dispatch loop can only run for a very limited number of iterations, and the language level process executes fewer instructions than an entire loop in one switch of the scheduler). The generated trace is almost the same, regardless of different reduction values.

Actually, the RPython JIT only cares what code it meets, but does not care who executes it, thus the JIT always generates the results above. The trace even can be shared among different threads if they execute the same code.

The overhead at low reduction value may be due to the scheduler, which switches from different processes too frequently, or from the too-frequent switching between bytecode interpreter and native code, but not from JIT itself.

Here is more explanation from Armin Rigo:

“The JIT works well because you’re using a scheme where some counter is decremented (and the soft-thread interrupted when it reaches zero) only once in each app-level loop. The soft-thread switch is done by returning to some scheduler, which will resume a different soft-thread by calling it. It means the JIT can still compile each of the loops as usual, with the generated machine code containing the decrease-and-check-for-zero operation which, when true, exits the assembler."

Fair Process Switching vs. Unfair Process Switching

We are also concerned about the timing for decreasing reduction value. In our initial version of Pyrlang, we decrease reduction value at every local function invocation, module function invocation, and BIF (built-in function) invocation, since this is what the official Erlang interpreter does. However, since the JIT in RPython basically traces the target language loop (which is the tail recursive invocation in Pyrlang) it is typically better to keep the loop whole during a switch of the language level process. We modified Pyrlang, and made the reduction decrement only occur after CALL_ONLY, which is actually the loop boundary of the target language.

Of course, this strategy may cause an “unfair” execution among language level processes. For example, if one process has only a single long-sequence code, it executes until the end of the code. On the other hand, if a process has a very short loop, it may be executed by very limited steps then be switched out by the scheduler. However, in the real world, this “unfairness” is usually considered acceptable, and is used in many VM implementations including PyPy for improving the overall performance.

We compared these two versions of Pyrlang in the FACT benchmark. The reduction decrement is quite different because there are some BIF invocations inside the loop. In the old version the process can be suspended at loop boundaries or other function invocation, but in the new version, it can be suspended only at loop boundaries.

We show that the strategy is effective, removing around 7% of the overhead. We have also compared it in REVERSE, but since there are no extra invocations inside the trace, it cannot provide any performance improvement. In the real world, we believe there is usually more than one extra invocation inside a single loop, so this strategy is effective for most cases.

Comparison with Default Erlang and HiPE

We compared the performance of Pyrlang with the default Erlang interpreter and the HiPE (High Performance Erlang) complier. HiPE is an official Erlang compiler that can compile Erlang source code to native code. The speed of Erlang programs obviously improves but loses its generality instead.

Please note that Pyrlang is still in development, so in some situations it does less work than the default Erlang interpreter, such as not checking integer overflow when dealing with big integer, and not checking and adding locks when accessing message queues in the language-level process, so is therefore faster. The final version of Pyrlang may be slower.

We used the two benchmark programs above, and made sure both of them are executed for more than five seconds to cover the JIT warm-up time for RPython. The experiment environment is a OS X 10.10 machine with 3.5GHZ 6-core Intel Xeon E5 CPU and 14GB 1866 MHz DDR3 ECC memory.

Let’s look at the result of FACT. The graph shows that Pyrlang runs 177.41% faster on average than Erlang, and runs at almost the same speed as HiPE. However, since we haven’t implemented big integer in Pyrlang, the arithmetical operators do not do any extra overflow checking. It is reasonable that the final version for Pyrlang will be slower than the current version and HiPE.

As for REVERSE, the graph shows that Pyrlang runs 45.09% faster than Erlang, but 63.45% slower than HiPE on average. We think this is reasonable because there are only few arithmetical operators in this benchmark so the speeds of these three implementations are closer. However, we observed that at the scale of 40,000, the speed of Pyrlang slowed down significantly (111.35% slower than HiPE) compared with the other two scales (56.38% and 22.63% slower than HiPE).

Until now we can only hypothesize why Pyrlang slows down at that scale. We guess that the overhead might be from GC. This is because the BEAM bytecode provides some GC hints to help the default Erlang compiler to perform some GC operations immediately. For example, using GC_BIF instead of a BIF instruction tells the VM that there may be a GC opportunity, and tells the VM how many live variables should be around one instruction. In Pyrlang we do not use these kinds of hints but rely on RPython’s GC totally. When there are a huge number of objects during runtime, (as for REVERSE, it should be the Erlang list object) the speed therefore slows down.

Ruochen Huang

Monday, February 23, 2015

linalg support in pypy/numpy

Introduction

PyPy's numpy support has matured enough that it can now support the lapack/blas libraries through the numpy.linalg module. To install the version of numpy this blog post refers to, install PyPy version 2.5.0 or newer, and run this:

pypy -m pip install git+https://bitbucket.org/pypy/numpy.git

This update is a major step forward for PyPy's numpy support. Many of the basic matrix operations depend on linalg, even matplotlib requires it to display legends (a pypy-friendly version of matplotlib 1.3 is available at https://github.com/mattip/matplotlib).

A number of improvements and adaptations, some of which are in the newly-released PyPy 2.5.0, made this possible:
  • Support for an extended frompyfunc(), which in the PyPy version supports much of the ufunc API (signatures, multiple dtypes) allowing creation of pure-python, jit-friendly ufuncs. An additional keyword allows choosing between out = func(in) or func(in, out) ufunc signatures. More explanation follows.
  • Support for GenericUfuncs via PyPy's (slow) capi-compatibility layer. The underlying mechanism actually calls the internal implementation of frompyfunc().
  • A cffi version of _umath_linalg. Since cffi uses dlopen() to call into shared objects, we added support in the numpy build system to create non-python shared libraries from source code in the numpy tree. We also rewrote parts of the c-based _umath_linalg.c.src in python, renamed numpy's umath_linalg capi module to umath_linag_capi, and use it as a shared object through cffi.

Status

We have not completely implemented all the linalg features. dtype resolution via casting is missing, especially for complex ndarrays, which leads to slight numerical errors where numpy uses a more precise type for intermediate calculations. Other missing features in PyPy's numpy support may have implications for complete linalg support.

Some OSX users have noticed they need to update pip to version 6.0.8 to overcome a regression in pip, and it is not clear if we support all combinations of blas/lapack implementations on all platforms.

Over the next few weeks we will be ironing out these issues.

Performance

A simple benchmark is shown below, but let's state the obvious: PyPy's JIT and the iterators built into PyPy's ndarray implementation will in most cases be no faster than CPython's numpy. The JIT can help where there is a mixture of python and numpy-array code. We do have plans to implement lazy evaluation and to further optimize PyPy's support for numeric python, but numpy is quite good at what it does.

HowTo for PyPy's extended frompyfunc

The magic enabling blas support is a rewrite of the _umath_linalg c-based module as a cffi-python module that creates ufuncs via frompyfunc. We extended the numpy frompyfunc to allow it to function as a replacement for the generic ufunc available in numpy only through the c-api.

We start with the basic frompyfunc, which wraps a python function into a ufunc:
 
def times2(in0):
    return in0 * 2
ufunc = frompyfunc(times2, 1, 1)

In cpython's numpy the dtype of the result is always object, which is not implemented (yet) in PyPy, so this example will fail. While the utility of object dtypes can be debated, in the meantime we add a non-numpy-compatible keyword argument dtypes to frompyfunc. If dtype=['match'] the output dtype will match the dtype of the first input ndarray:

ufunc = frompyfunc(times2, 1, 1, dtype=['match'])
ai = arange(24).reshape(3, 4, 2)
ao = ufunc(ai)
assert  (ao == ai * 2).all()

I hear you ask "why is the dtypes keyword argument a list?" This is so we can support the Generalized Universal Function API, which allows specifying a number of specialized functions and the input-output dtypes each specialized function accepts.
Note that the function feeds the values of ai one at a time, the function operates on scalar values. To support more complicated ufunc calls, the generalized ufunc API allows defining a signature, which specifies the layout of the ndarray inputs and outputs. So we extended frompyfunc with a signature keyword as well.
We add one further extension to frompyfunc: we allow a Boolean keyword stack_inputs to specify the argument layout of the function itself. If the function is of the form:
 
out0, out1, ... = func(in0, in1,...)

then stack_inputs is False. If it is True the function is of the form:
 
func(in0, in1, ... out0, out1, ...)

Here is a complete example of using frompyfunc to create a ufunc, based on this link:
 
def times2(in_array, out_array):
    in_flat = in_array.flat
    out_flat = out_array.flat
    for i in range(in_array.size):
        out_flat[i] = in_flat[i] * 2
ufunc = frompyfunc([times2, times2], 1, 1,
                signature='(i)->(i)',
                dtypes=[dtype(int), dtype(int),
                        dtype(float), dtype(float),
                       ],
                stack_inputs=True,
                )
ai = arange(10, dtype=int)
ai2 = ufunc(ai)
assert all(ai2 == ai * 2)

Using this extended syntax, we rewrote the lapack calls into the blas functions in pure python, no c needed. Benchmarking this approach actually was much slower than using the upstream umath_linalg module via cpyext, as can be seen in the following benchmarks. This is due to the need to copy c-aligned data into Fortran-aligned format. Our __getitem__ and __setitem__ iterators are not as fast as pointer arithmetic in C. So we next tried a hybrid approach: compile and use numpy's umath_linalg python module as a shared object, and call the optimized specific wrapper function from it.

Benchmarks

Here are some benchmarks, running a tight loop of the different versions of linalg.inv(a), where a is a 10x10 double ndarray. The benchmark ran on an i7 processor running ubuntu 14.04 64 bit:
Impl. Time after warmup
CPython 2.7 + numpy 1.10.dev + lapack 8.9 msec/1000 loops
PyPy 2.5.0 + numpy + lapack via cpyext 8.6 msec/1000 loops
PyPy 2.5.0 + numpy + lapack via pure python + cffi 19.9 msec/1000 loops
PyPy 2.5.0 + numpy + lapack via python + c + cffi 9.5 msec/1000 loops


While no general conclusions may be drawn from a single micro-benchmark, it does indicate that there is some merit in the approach taken.

Conclusion

PyPy's numpy now includes a working linalg module. There are still some rough corners, but hopefully we have implemented the parts you need. While the speed of the isolated linalg function is no faster than CPython and upstream numpy, it should not be significantly slower either. Your use case may see an improvement if you use a mix of python and lapack, which is the usual case.

Please let us know how it goes. We love to hear success stories too.

We still have challenges at all levels of programming,and are always looking for people willing to contribute, so stop by on IRC at #pypy.

mattip and the PyPy Team

Wednesday, February 11, 2015

NumPyPy status - January 2015

Hi Everyone

Here is what has been done in January thanks to the funding of NumPyPy, I would like to thank all the donors and tell you that you can still donate :
  • I have focused on implementing the object dtype this month, it is now possible to store objects inside ndarrays using the object dtype
  • It is also possible to add an object ndarray to any other ndarray (implementing other operators is trivial)
The next things I plan on working on next are :
  • Implementing the missing operations for object arrays
  • Implementing garbage collection support for object arrays (currently, storing an object inside an ndarray doesn't keep the object alive)
  • Packaging NumPyPy on PyPI
Cheers
Romain

Tuesday, February 3, 2015

PyPy 2.5.0 released

PyPy 2.5.0 - Pincushion Protea

We’re pleased to announce PyPy 2.5, which contains significant performance enhancements and bug fixes.
You can download the PyPy 2.5.0 release here:
We would like to thank our donors for the continued support of the PyPy project, and for those who donate to our three sub-projects, as well as our volunteers and contributors (10 new commiters joined PyPy since the last release). We’ve shown quite a bit of progress, but we’re slowly running out of funds. Please consider donating more, or even better convince your employer to donate, so we can finish those projects! The three sub-projects are:
  • Py3k (supporting Python 3.x): We have released a Python 3.2.5 compatible version
    we call PyPy3 2.4.0, and are working toward a Python 3.3 compatible version
  • STM (software transactional memory): We have released a first working version, and continue to try out new promising paths of achieving a fast multithreaded Python
  • NumPy which requires installation of our fork of upstream numpy, available on bitbucket

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows, and OpenBSD), as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.
While we support 32 bit python on Windows, work on the native Windows 64 bit python is still stalling, we would welcome a volunteer to handle that.

Highlights

  • The past months have seen pypy mature and grow, as rpython becomes the goto solution for writing fast dynamic language interpreters. Our separation of rpython and the python interpreter PyPy is now much clearer in the PyPy documentation and we now have separate RPython documentation.
  • We have improved warmup time as well as jitted code performance: more than 10% compared to pypy-2.4.0. We no longer zero-out memory allocated in the gc nursery by default, work that was started during a GSoC.
  • Passing objects between C and PyPy has been improved. We are now able to pass raw pointers to C (without copying) using pinning. This improves I/O; benchmarks that use networking intensively improved by about 50%. File() operations still need some refactoring but are already showing a 20% improvement on our benchmarks. Let us know if you see similar improvements.
  • Our integrated numpy support gained much of the GenericUfunc api in order to support the lapack/blas linalg module of numpy. This dovetails with work in the pypy/numpy repository to support linalg both through the (slower) cpyext capi interface and also via (the faster) pure python cffi interface, using an extended frompyfunc() api. We will soon post a seperate blog post specifically about linalg and PyPy.
  • Dictionaries are now ordered by default, see the blog post
  • Our nightly translations use –shared by default, including on OS/X and linux
  • We now more carefully handle errno (and GetLastError, WSAGetLastError) tying the handlers as close as possible to the external function call, in non-jitted as well as jitted code.
  • Issues reported with our previous release were resolved after reports from users on our issue tracker at https://bitbucket.org/pypy/pypy/issues or on IRC at #pypy.
We have further improvements on the way: rpython file handling, finishing numpy linalg compatibility, numpy object dtypes, a better profiler, as well as support for Python stdlib 2.7.9.
Please try it out and let us know what you think. We especially welcome success stories, we know you are using PyPy, please tell us about it!
Cheers
The PyPy Team