Monday, March 30, 2015

PyPy-STM 2.5.1 released

PyPy-STM 2.5.1 - Mawhrin-Skel

We're pleased to announce PyPy-STM 2.5.1, codenamed Mawhrin-Skel. This is the second official release of PyPy-STM. You can download this release here (64-bit Linux only):

http://pypy.org/download.html

Documentation:

http://pypy.readthedocs.org/en/latest/stm.html

PyPy is an implementation of the Python programming language which focuses on performance. So far we've been relentlessly optimizing for the single core/process scenario. PyPy STM brings to the table a version of PyPy that does not have the infamous Global Interpreter Lock, hence can run multiple threads on multiple cores. Additionally it comes with a set of primitives that make writing multithreaded applications a lot easier, as explained below (see TransactionQueue) and in the documentation.

Internally, PyPy-STM is based on the Software Transactional Memory plug-in called stmgc-c7. This version comes with a relatively reasonable single-core overhead but scales only up to around 4 cores on some examples; the next version of the plug-in, stmgc-c8, is in development and should address that limitation (as well as reduce the overhead). These versions only support 64-bit Linux; we'd welcome someone to port the upcoming stmgc-c8 to other (64-bit) platforms.

This release passes all regular PyPy tests, except for a few special cases. In other words, you should be able to drop in PyPy-STM instead of the regular PyPy and your program should still work. See current status for more information.

This work was done by Remi Meier and Armin Rigo. Thanks to all donors for crowd-funding the STM work so far! As usual, it took longer than we would have thought. I really want to thank the people that kept making donations anyway. Your trust is greatly appreciated!


What's new?

Compared to the July 2014 release, the main addition is a way to get reports about STM conflicts. This is an essential new feature.

To understand why this is so important, consider that if you already played around with the previous release, chances are that you didn't get very far. It probably felt like a toy: on very small examples it would nicely scale, but on any larger example it would not scale at all. You didn't get any feedback about why, but the underlying reason is that, in a typical large example, there are some STM conflicts that occur all the time and that won't be immediately found just by thinking. This prevents any parallelization.

Now PyPy-STM is no longer a black box: you have a way to learn about these conflicts, fix them, and try again. The tl;dr version is to run:

    PYPYSTM=stmlog ./pypy-stm example.py
    ./print_stm_log.py stmlog

More details in the STM user guide.


Performance

The performance is now more stable than it used to be. More precisely, the best case is still "25%-40% single-core slow-down with very good scaling up to 4 threads", but the average performance seems not too far from that. There are still dark spots --- notably, the JIT is still slower to warm up, though it was improved a lot. These are documented in the current status section. Apart from that, we should not get more than 2x single-core slow-down in the worst case. Please report such cases as bugs!


TransactionQueue

As explained before, PyPy-STM is more than "just" a Python without GIL. It is a Python in which you can do minor tweaks to your existing, non-multithreaded programs and get them to use multiple cores. You identify medium- or large-sized, likely-independent parts of the code and to ask PyPy-STM to run these parts in parallel. An example would be every iteration of some outermost loop over all items of a dictionary. This is done with a new API: transaction.TransactionQueue(). See help(TransactionQueue) or read more about it in the STM user guide.

This is not a 100% mechanical change: very likely, you need to hunt for and fix "STM conflicts" that prevent parallel execution (see docs). However, at all points your program runs correctly, and you can stop the hunt when you get acceptable performance. You don't get deadlocks or corrupted state.

Thanks for reading!
Armin, Remi, Fijal

Thursday, March 26, 2015

PyPy 2.5.1 Released

PyPy 2.5.1 - Pineapple Bromeliad

We’re pleased to announce PyPy 2.5.1, Pineapple Bromeliad following on the heels of 2.5.0. You can download the PyPy 2.5.1 release here:
We would like to thank our donors for the continued support of the PyPy project, and for those who donate to our three sub-projects, as well as our volunteers and contributors. We’ve shown quite a bit of progress, but we’re slowly running out of funds. Please consider donating more, or even better convince your employer to donate, so we can finish those projects! The three sub-projects are:
  • Py3k (supporting Python 3.x): We have released a Python 3.2.5 compatible version we call PyPy3 2.4.0, and are working toward a Python 3.3 compatible version
     
  • STM (software transactional memory): We have released a first working version, and continue to try out new promising paths of achieving a fast multithreaded Python

  • NumPy which requires installation of our fork of upstream numpy, available on bitbucket
We would also like to encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and Rpython documentation improvements, tweaking popular modules to run on pypy, or general help with making Rpython’s JIT even better.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows, and OpenBSD), as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.

While we support 32 bit python on Windows, work on the native Windows 64 bit python is still stalling, we would welcome a volunteer to handle that.

Highlights

  • The past months have seen pypy mature and grow, as rpython becomes the goto solution for writing fast dynamic language interpreters. Our separation of Rpython from the python interpreter PyPy is now much clearer in the PyPy documentation and we now have seperate RPython documentation. Tell us what still isn’t clear, or even better help us improve the documentation.
  • We merged version 2.7.9 of python’s stdlib. From the python release notice:
    • The entirety of Python 3.4’s ssl module has been backported. See PEP 466 for justification.
    • HTTPS certificate validation using the system’s certificate store is now enabled by default. See PEP 476 for details.
    • SSLv3 has been disabled by default in httplib and its reverse dependencies due to the POODLE attack.
    • The ensurepip module has been backported, which provides the pip package manager in every Python 2.7 installation. See PEP 477.

  • The garbage collector now ignores parts of the stack which did not change since the last collection, another performance boost
  • errno and LastError are saved around cffi calls so things like pdb will not overwrite it
  • We continue to asymptotically approach a score of 7 times faster than cpython on our benchmark suite, we now rank 6.98 on latest runs
Please try it out and let us know what you think. We welcome success stories, experiments, or benchmarks, we know you are using PyPy, please tell us about it!
Cheers
The PyPy Team

Friday, March 13, 2015

Pydgin: Using RPython to Generate Fast Instruction-Set Simulators

Note: This is a guest blog post by Derek Lockhart and Berkin Ilbeyi from Computer Systems Laboratory of Cornell University.

In this blog post I'd like to describe some recent work on using the RPython translation toolchain to generate fast instruction set simulators. Our open-source framework, Pydgin [a], provides a domain-specific language (DSL) embedded in Python for concisely describing instruction set architectures [b] and then uses these descriptions to generate fast, JIT-enabled simulators. Pydgin will be presented at the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) and in this post we provide a preview of that work. In addition, we discuss some additional progress updates that occurred after the publishing deadline and will not appear in the final paper [1].

Our area of research expertise is computer architecture, which is perhaps an unfamiliar topic for some readers of the PyPy blog. Below we provide some brief background on hardware simulation in the field of computer architecture, as well as some context as to why instruction set simulators in particular are such an important tool.

Simulators: Designing Hardware with Software

For computer architects in both academia and industry, a key step in designing new computational hardware (e.g., CPUs, GPUs, and mobile system-on-chips) is simulation [c] of the target system. While numerous models for simulation exist, three classes are particularly important in hardware design.

Functional Level models simulate the behavior of the target system. These models are useful for creating a "golden" reference which can serve as an executable specification or alternatively as an emulation platform for software development.

Cycle Level models aim to simulate both the behavior and the approximate timing of a hardware component. These models help computer architects explore design tradeoffs and quickly determine things like how big caches should be, how many functional units are needed to meet throughput targets, and how the addition of a custom accelerator block may impact total system performance.

Register-Transfer Level (RTL) models specify the behavior, timing, and resources (e.g., registers, wires, logic gates) of a hardware component. RTL models are bit-accurate hardware specifications typically written in a hardware description language (HDL) such as Verilog or VHDL. Once verified through extensive simulation, HDL specifications can be passed into synthesis and place-and-route tools to estimate area/energy/timing or to create FPGA or ASIC prototypes.

An instruction set simulator (ISS) is a special kind of functional-level model that simulates the behavior of a processor or system-on-chip (SOC). ISSs serve an important role in hardware design because they model the instruction set architecture (ISA) interface: the contractual boundary between hardware designers and software developers. ISSs allow hardware designers to quickly experiment with adding new processor instructions while also allowing software developers to build new compilers, libraries, and applications long before physical silicon is available.

Instruction-Set Simulators Must be Fast and Productive

Instruction-set simulators are more important than ever because the ISA boundary has become increasingly fluid. While Moore's law has continued to deliver larger numbers of transistors which computer architects can use to build increasingly complex chips, limits in Dennard scaling have restricted how these transistors can be used [d]. In more simple terms, thermal constraints (and energy constraints in mobile devices) have resulted in a growing interest in pervasive specialization: using custom accelerators to more efficiently perform compute intensive tasks. This is already a reality for designers of mobile SOCs who continually add new accelerator blocks and custom processor instructions in order to achieve higher performance with less energy consumption. ISSs are indispensable tools in this SOC design process for both hardware architects building the silicon and software engineers developing the software stack on top of it.

An instruction set simulator has two primary responsibilities: 1) accurately emulating the external execution behavior of the target, and 2) providing observability by accurately reproducing the target's internal state (e.g., register values, program counter, status flags) at each time step. However, other qualities critical to an effective ISS are simulation performance and designer productivity. Simulation performance is important because shorter simulation times allow developers to more quickly execute and verify large software applications. Designer productivity is important because it allows hardware architects to easily experiment with adding new instructions and estimate their impact on application performance.

To improve simulation performance, high-performance ISSs use dynamic binary translation (DBT) as a mechanism to translate frequently visited blocks of target instructions into optimized sequences of host instructions. To improve designer productivity, many design toolchains automatically generate ISSs from an architectural description language (ADL): a special domain-specific language for succinctly specifying instruction encodings and instruction semantics of an ISA. Very few existing systems have managed to encapsulate the design complexity of DBT engines such that high-performance, DBT-accelerated ISSs could be automatically generated from ADLs [e]. Unfortunately, tools which have done so are either proprietary software or leave much to be desired in terms of performance or productivity.

Why RPython?

Our research group learned of the RPython translation toolchain through our experiences with PyPy, which we had used in conjunction with our Python hardware modeling framework to achieve significant improvements in simulation performance [2]. We realized that the RPython translation toolchain could potentially be adapted to create fast instruction set simulators since the process of interpreting executables comprised of binary instructions shared many similarities with the process of interpreting bytecodes in a dynamic-language VM. In addition, we were inspired by PyPy's meta-tracing approach to JIT-optimizing VM design which effectively separates the process of specifying a language interpreter from the optimization machinery needed to achieve good performance.

Existing ADL-driven ISS generators have tended to use domain-specific languages that require custom parsers or verbose C-based syntax that distracts from the instruction specification. Creating an embedded-ADL within Python provides several benefits over these existing approaches including a gentler learning curve for new users, access to better debugging tools, and easier maintenance and extension by avoiding a custom parser. Additionally, we have found that the ability to directly execute Pydgin ISA descriptions in a standard Python interpreter such as CPython or PyPy significantly helps debugging and testing during initial ISA exploration. Python's concise, pseudocode-like syntax also manages to map quite closely to the pseudocode specifications provided by many ISA manuals [f].

The Pydgin embedded-ADL

Defining a new ISA in the Pydgin embedded-ADL requires four primary pieces of information: the architectural state (e.g. register file, program counter, control registers), the bit encodings of each instruction, the instruction fields, and the semantic definitions for each instruction. Pydgin aims to make this process as painless as possible by providing helper classes and functions where possible.

For example, below we provide a truncated example of the ARMv5 instruction encoding table. Pydgin maintains encodings of all instructions in a centralized encodings data structure for easy maintenance and quick lookup. The user-provided instruction names and bit encodings are used to automatically generate decoders for the simulator. Unlike many ADLs, Pydgin does not require that the user explicitly specify instruction types or mask bits for field matching because the Pydgin decoder generator can automatically infer decoder fields from the encoding table.

encodings = [
  ['adc',      'xxxx00x0101xxxxxxxxxxxxxxxxxxxxx'],
  ['add',      'xxxx00x0100xxxxxxxxxxxxxxxxxxxxx'],
  ['and',      'xxxx00x0000xxxxxxxxxxxxxxxxxxxxx'],
  ['b',        'xxxx1010xxxxxxxxxxxxxxxxxxxxxxxx'],
  ['bl',       'xxxx1011xxxxxxxxxxxxxxxxxxxxxxxx'],
  ['bic',      'xxxx00x1110xxxxxxxxxxxxxxxxxxxxx'],
  ['bkpt',     '111000010010xxxxxxxxxxxx0111xxxx'],
  ['blx1',     '1111101xxxxxxxxxxxxxxxxxxxxxxxxx'],
  ['blx2',     'xxxx00010010xxxxxxxxxxxx0011xxxx'],
  # ...
  ['teq',      'xxxx00x10011xxxxxxxxxxxxxxxxxxxx'],
  ['tst',      'xxxx00x10001xxxxxxxxxxxxxxxxxxxx'],
]

A major goal of Pydgin was ensuring instruction semantic definitions map to ISA manual specifications as much as possible. The code below shows one such definition for the ARMv5 add instruction. A user-defined Instruction class (not shown) specifies field names that can be used to conveniently access bit positions within an instruction (e.g. rd, rn, S). Additionally, users can choose to define their own helper functions, such as the condition_passed function, to create more concise syntax that better matches the ISA manual.

def execute_add( s, inst ):
  if condition_passed( s, inst.cond() ):
    a,   = s.rf[ inst.rn() ]
    b, _ = shifter_operand( s, inst )
    result = a + b
    s.rf[ inst.rd() ] = trim_32( result )

    if inst.S():
      if inst.rd() == 15:
        raise FatalError('Writing SPSR not implemented!')
      s.N = (result >> 31)&1
      s.Z = trim_32( result ) == 0
      s.C = carry_from( result )
      s.V = overflow_from_add( a, b, result )

    if inst.rd() == 15:
      return

  s.rf[PC] = s.fetch_pc() + 4

Compared to the ARM ISA Reference manual shown below, the Pydgin instruction definition is a fairly close match. Pydgin's definitions could certainly be made more concise by using a custom DSL, however, this would lose many of the debugging benefits afforded to a well-supported language such as Python and additionally require using a custom parser that would likely need modification for each new ISA.

if ConditionPassed(cond) then
   Rd = Rn + shifter_operand
   if S == 1 and Rd == R15 then
     if CurrentModeHasSPSR() then CPSR = SPSR
   else UNPREDICTABLE else if S == 1 then
     N Flag = Rd[31]
     Z Flag = if Rd == 0 then 1 else 0
     C Flag = CarryFrom(Rn + shifter_operand)
     V Flag = OverflowFrom(Rn + shifter_operand)

Creating an ISS that can run real applications is a rather complex task, even for a bare metal simulator with no operating system such as Pydgin. Each system call in the C library must be properly implemented, and bootstrapping code must be provided to set up the program stack and architectural state. This is a very tedious and error prone process which Pydgin tries to encapsulate so that it remains as transparent to the end user as possible. In future versions of Pydgin we hope to make bootstrapping more painless and support a wider variety of C libraries.

Pydgin Performance

In order to achieve good simulation performance from Pydgin ISSs, significant work went into adding appropriate JIT annotations to the Pydgin library components. These optimization hints, which allow the JIT generated by the RPython translation toolchain to produce more efficient code, have been specifically selected for the unique properties of ISSs. For the sake of brevity, we do not talk about the exact optimizations here but a detailed discussion can be found in the ISPASS paper [1]. In the paper we evaluate two ISSs, one for a simplified MIPS ISA and another for the ARMv5 ISA, whereas below we only discuss results for the ARMv5 ISS.

The performance of Pydgin-generated ARMv5 ISSs were compared against several reference ISSs: the gem5 ARM atomic simulator (gem5), interpretive and JIT-enabled versions of SimIt-ARM (simit-nojit and simit-jit), and QEMU. Atomic models from the gem5 simulator were chosen for comparison due their wide usage amongst computer architects [g]. SimIt-ARM was selected because it is currently the highest performance ADL-generated DBT-ISS publicly available. QEMU has long been held as the gold-standard for DBT simulators due to its extremely high performance, however, QEMU is generally intended for usage as an emulator rather than a simulator [c] and therefore achieves its excellent performance at the cost of observability. Unlike QEMU, all other simulators in our study faithfully track architectural state at an instruction level rather than block level. Pydgin ISSs were generated with and without JITs using the RPython translation toolchain in order to help quantify the performance benefit of the meta-tracing JIT.

The figure below shows the performance of each ISS executing applications from the SPEC CINT2006 benchmark suite [h]. Benchmarks were run to completion on the high-performance DBT-ISSs (simit-jit, pydgin-jit, and QEMU), but were terminated after only 10 billion simulated instructions for the non-JITed interpretive ISSs (these would require many hours, in some cases days, to run to completion). Simulation performance is measured in MIPS [i] and plotted on a log scale due to the wide variance in performance. The WHMEAN group summarizes each ISS's performance across all benchmarks using the weighted harmonic mean.

A few points to take away from these results:

  • ISSs without JITs (gem5, simit-nojit, and pydgin-nojit) demonstrate relatively consistent performance across applications, whereas ISSs with JITs (simit-jit, pydgin-jit, and QEMU) demonstrate much greater performance variability from application-to-application.
  • The gem5 atomic model demonstrates particularly miserable performance, only 2-3 MIPS!
  • QEMU lives up to its reputation as a gold-standard for simulator performance, leading the pack on nearly every benchmark and reaching speeds of 240-1120 MIPS.
  • pydgin-jit is able to outperform simit-jit on four of the applications, including considerable performance improvements of 1.44–1.52× for the applications 456.hmmer, 462.libquantum, and 471.omnetpp (managing to even outperform QEMU on 471.omnetpp).
  • simit-jit is able to obtain much more consistent performance (230-459 MIPS across all applications) than pydgin-jit (9.6-659 MIPS). This is due to simit-jit's page-based approach to JIT optimization compared to pydgin-jit's tracing-based approach.
  • 464.h264ref displays particularly bad pathological behavior in Pydgin’s tracing JIT and is the only application to perform worse on pydgin-jit than pydgin-nojit (9.6 MIPS vs. 21 MIPS).

The pathological behavior demonstrated by 464.h264ref was of particular concern because it caused pydgin-jit to perform even worse than having no JIT at all. RPython JIT logs indicated that the reason for this performance degradation was a large number of tracing aborts due to JIT traces growing too long. However, time limitations before the publication deadline prevented us from investigating this issue thoroughly.

Since the deadline we've applied some minor bug fixes and made some small improvements in the memory representation. More importantly, we've addressed the performance degradation in 464.h264ref by increasing trace lengths for the JIT. Below we show how the performance of 464.h264ref changes as the trace_limit parameter exposed by the RPython JIT is varied from the default size of 6000 operations.

By quadrupling the trace limit we achieve an 11x performance improvement in 464.h264ref. The larger trace limit allows the JIT to optimize long code paths that were previously triggering trace aborts, greatly helping amortize the costs of tracing. Note that arbitrarily increasing this limit can potentially hurt performance if longer traces are not able to detect optimizable code sequences.

After performing similar experiments across the applications in the SPEC CINT2006 benchmark suite, we settled on a trace limit of 400,000 operations. In the figure below we show how the updated Pydgin ISS (pydgin-400K) improves performance across all benchmarks and fixes the performance degradation previously seen in 464.h264ref. Note that the non-JITted simulators have been removed for clarity, and simulation performance is now plotted on a linear scale to more clearly distinguish the performance gap between each ISS.

With these improvements, we are now able to beat simit-jit on all but two benchmarks. In future work we hope to further close the gap with QEMU as well.

Conclusions and Future Work

Pydgin demonstrates that the impressive work put into the RPython translation toolchain, designed to simplify the process of building fast dynamic-language VMs, can also be leveraged to build fast instruction set simulators. Our prototype ARMv5 ISS shows that Pydgin can generate ISSs with performance competitive to SimIt-ARM while also providing a more productive development experience: RPython allowed us to develop Pydgin with only four person-months of work. Another significant benefit of the Pydgin approach is that any performance improvements applied to the RPython translation toolchain immediately benefit Pydgin ISSs after a simple software download and retranslation. This allows Pydgin to track the continual advances in JIT technology introduced by the PyPy development team.

Pydgin is very much a work in progress. There are many features we would like to add, including:

  • more concise syntax for accessing arbitrary instruction bits
  • support for other C libraries such as glibc, uClibc, and musl (we currently only support binaries compiled with newlib)
  • support for self-modifying code
  • features for more productive debugging of target applications
  • ISS descriptions for other ISAs such as RISC-V, ARMv8, and x86
  • automatic generation of compilers and toolchains from Pydgin descriptions

In addition, we think there are opportunities for even greater performance improvements with more advanced techniques such as:

  • automatic generation of optimized instruction decoders
  • optimizations for floating-point intensive applications
  • multiple tracing-JITs for parallel simulation of multicore SOCs
  • a parallel JIT compilation engine as proposed by Böhm et al. [3]

We hope that Pydgin can be of use to others, so if you try it out please let us know what you think. Feel free to contact us if you find any of the above development projects interesting, or simply fork the project on GitHub and hack away!

-- Derek Lockhart and Berkin Ilbeyi

Acknowledgements

We would like to sincerely thank Carl Friedrich Bolz and Maciej Fijalkowski for their feedback on the Pydgin publication and their guidance on improving the JIT performance of our simulators. We would also like to thank for the whole PyPy team for their incredible work on the PyPy and the RPython translation toolchain. Finally, thank you to our research advisor, Prof. Christopher Batten, and the sponsors of this work which include the National Science Foundation, the Defense Advanced Research Projects Agency, and Intel Corporation.

Footnotes

[a]Pydgin loosely stands for [Py]thon [D]SL for [G]enerating [In]struction set simulators and is pronounced the same as “pigeon”. The name is inspired by the word “pidgin” which is a grammatically simplified form of language and captures the intent of the Pydgin embedded-ADL. https://github.com/cornell-brg/pydgin
[b]Popular instruction set architectures (ISAs) include MIPs, ARM, x86, and more recently RISC-V
[c](1, 2) For a good discussion of simulators vs. emulators, please see the following post on StackOverflow: http://stackoverflow.com/questions/1584617/simulator-or-emulator-what-is-the-difference
[d]http://en.wikipedia.org/wiki/Dark_silicon
[e]Please see the Pydgin paper for a more detailed discussion of prior work.
[f]

For more examples of Pydgin ISA specifications, please see the ISPASS paper [1] or the Pydgin source code on GitHub.

Pydgin instruction definitions for a simple MIPS-inspired ISA can be found here:

Pydgin instruction definitions for a simplified ARMv5 ISA can be found here:

[g]

gem5 is a cycle-level simulation framework that contains both functional-level (atomic) and cycle-level processor models. Although primarily used for detailed, cycle-approximate processor simulation, gem5's atomic model is a popular tool for many ISS tasks.

[h]All performance measurements were taken on an unloaded server-class machine.
[i]Millions of instructions per second.

References

[1](1, 2, 3)

Derek Lockhart, Berkin Ilbeyi, and Christopher Batten. "Pydgin: Generating Fast Instruction Set Simulators from Simple Architecture Descriptions with Meta-Tracing JIT Compilers." IEEE Int'l Symp. on Performance Analysis of Systems and Software (ISPASS), Mar. 2015.

[2]

Derek Lockhart, Gary Zibrat, and Christopher Batten. "PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research." 47th ACM/IEEE Int'l Symp. on Microarchitecture (MICRO-47), Dec. 2014.

[3]I. Böhm, B. Franke, and N. Topham. Generalized Just-In-Time Trace Compilation Using a Parallel Task Farm in a Dynamic Binary Translator. ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Jun 2011.