On the speed of pypy

I had heard that pypy was fast. Like really fast.

Well, it's true! In the following post, I'll show you how one data mining algorithm went from 23 seconds (cpython) to 4 seconds (pypy). Without any modification, tweak, or special compiler/interpreter switch. I actually installed pypy 1.5.2 from the archlinux community repository so I did not compile it.

The Story

I recently searched for an implementation of a frequent item set mining algorithm in Python but I could not find a library that was easy to use and that implemented a recent algorithm (apriori is quite old now).

I ended up implementing three frequent item set mining algorithms in python and, although I was pleased with the result, I found that they were slow. The following session with Python 3.2.1 shows how much time it took to run 10 rounds of three different frequent item set mining algorithms with the library I created:

     Python 2.7.2 (default, Jun 29 2011, 11:10:00)
     [GCC 4.6.1] on linux2
     Type "help", "copyright", "credits" or "license" for more...
     >>> from pymining import itemmining
     >>> itemmining.test_perf(seed=200)
     Random transactions generated

     Done round 0
     Done round 1
     Done round 2
     Done round 3
     Done round 4
     Done round 5
     Done round 6
     Done round 7
     Done round 8
     Done round 9
     FP-Growth (pruning off) took: 97.9762558937
     Computed 1491 frequent item sets.
     Done round 0
     Done round 1
     Done round 2
     Done round 3
     Done round 4
     Done round 5
     Done round 6
     Done round 7
     Done round 8
     Done round 9
     Relim took: 22.12490201
     Computed 1491 frequent item sets.
     Done round 0
     Done round 1
     Done round 2
     Done round 3
     Done round 4
     Done round 5
     Done round 6
     Done round 7
     Done round 8
     Done round 9
     Sam took: 50.3823070526
     Computed 1491 frequent item sets.

This morning, I thought about pypy and gave it a shot:

     Python 2.7.1 (?, May 28 2011, 20:50:19)
     [PyPy 1.5.0-alpha0 with GCC 4.6.0] on linux2
     Type "help", "copyright", "credits" or "license" for more...
     And now for something completely different: ...
     >>> from pymining import itemmining
     >>> itemmining.test_perf(seed=200)
     Random transactions generated

     Done round 0
     Done round 1
     Done round 2
     Done round 3
     Done round 4
     Done round 5
     Done round 6
     Done round 7
     Done round 8
     Done round 9
     FP-Growth (pruning off) took: 21.6948671341
     Computed 1491 frequent item sets.
     Done round 0
     Done round 1
     Done round 2
     Done round 3
     Done round 4
     Done round 5
     Done round 6
     Done round 7
     Done round 8
     Done round 9
     Relim took: 4.04977107048
     Computed 1491 frequent item sets.
     Done round 0
     Done round 1
     Done round 2
     Done round 3
     Done round 4
     Done round 5
     Done round 6
     Done round 7
     Done round 8
     Done round 9
     Sam took: 20.4720339775
     Computed 1491 frequent item sets.

Yup, you read this correctly. cpython was ~5 times slower than pypy for FP-Growth and relim, and ~2 times slower for sam.

Although pypy is speedy, it still does not work with many c extensions (particularly cython extensions), so I have to do some of my work in two virtual environments. For a few seconds, I would not bother, but considering the kind of data I work with these days, it could save me hours.

Interested in pymining?

Fork it on github! Or just install it with pip:

pip install -e git://github.com/bartdag/pymining#egg=pymining