Comparison between Numba and numpy.ctypeslib

Table of Contents

Introduction
- TL;DR of this post
Background
- Method 1: Using SciPy's ODE solver
- Method 2: Finding roots of a complex polynomial
References

Introduction

Here I will discuss usages of Numba and NumPy's ctypes interface and compare them with each other and also with a pure Python example.

There are many nice benchmarks of optimizing Python using Numba, NumPy, or other libraries, for example see [1] where a clean comparison among optimizations using various libraries are provided with profilings. In comparison, here I will put the comparison in the context of a real-world example of a project that I have been working on, the overview of which can be found at [2]. Because the routine to optimize is a part of a larger program, this provides a realistic estimate of how much mileage we get by using Numba and NumPy ctypes interface, although it does not give us a clear-cut comparisons like [1].

TL;DR of this post

Numba's performance is great when it works, it's near to C.
But when Numba does not compile a given code, it is quite difficult to make it work
Try compiling a code to optimize using Numba with little to no modification, and if it does not work it may be easier to write a C code and use NumPy ctypes interface than debugging the Numba optimization.

Background

The numerical problem we want to solve is briefly described in [2], which is like the following. We have a first-order differential equation describing a flow at every point on a given two-dimensional surface, and we want to find solutions of the differential equation for a set of initial conditions.

Method 1: Using SciPy's ODE solver

A natural starting point when solving a differential equation in a Python program is using SciPy's ODE solver. Let me emphasize that, because this is a totally different approach from using Numba or NumPy's ctypes interface, it may not be a faithful toe-to-toe comparison, but it is a good baseline. And because the core of the solver library is written in Fotran, we expect it to be reasonably fast and indeed it is the case. After we cast the differential equation into the form of \((\frac{\dd z}{\dd t}, \frac{\dd x}{\dd t}) = f(z, x)\), a simplified version of the relevant code is like the following:

ode = scipy.integrate.ode(f)
ode.set_integrator('zvode')

step = 0
ode.set_initial_value(z[step], x[step])
while step < (array_size - 1) and ode.successful():
    step += 1
    z[step], x[step] = ode.integrate(ode.t + dt)

For the entire code, see class GrowLibs and grow() in http://github.com/chan-y-park/loom/blob/master/loom/s_wall.py. Now let's run it from a Jupyter/IPython notebook with %%timeit to measure how long it takes to generate a result for the default configuration.

As see in the profiling results, SciPy's ODE solver is reasonably fast, although it is not as fast as others. Rather, the issue was that, for certain configurations other than the simple one we used here, the numerical accuracy of the solutions obtained via the ODE solver are not good enough in the context of the specific problem we are trying to solve here. Probably there is a way to fine-tune various parameters of the ODE solver, but the core is written in Fortran (see http://www.netlib.org/ode/zvode.f for the Fortran source code) therefore I found it hard to read the code and do the right guesswork of tuning the parameters.

To summarize, SciPy's ODE solver is a general-purpose solver that is easy to use and has good performance, but once you see the program working and want to optimize the solver part it may be better to look for an alternative that can be specialized to your problem, which is what I did in my project.

Method 2: Finding roots of a complex polynomial

Thanks to the specific setting of our problem, solving the differential equation is equivalent to finding a solution of \(f(z, x) = 0\), where :math:'z' and :math:'x' are complex variables and the value of :math:'z' is provided along the flow from the differential equation. Therefore, for a given :math:'z = z_0', the numerical problem becomes finding roots of \(f(z_0, x) = 0\) and pick one of the roots that is closest to the value of :math:'x' at the previous step along the flow.

Method 2.1: Using NumPy

One way to attack the problem is using numpy.roots to find all the roots and then pick among the roots the one that is closes to the previous solution. A simplified version of the Python code is like the following:

x, z = sympy.symbols('x z')

def get_xs(f, z_0):
    fx = f.subs(z, z_0)
    sym_poly = sympy.Poly(fx, x, domain='CC')
    coeff_list = map(complex, sym_poly.all_coeffs())
    numpy.roots(coeff_list)

step = 0
while step < (array_size - 1):
    z_i = z[step]
    x_i = x[step]
    xs = get_xs(f, z_i)

    step += 1

    z[step] = z_i + dt * dz_dt(z_i, x_i)
    x[step] = xs_at_z_n[numpy.argmin(xs - x_i)]

For the entire code, see SWCurve.get_xs() in http://github.com/chan-y-park/loom/blob/master/loom/geometry.py and grow() in http://github.com/chan-y-park/loom/blob/master/loom/s_wall.py. Let's run this implementation and compare its performance to that of the ODE solver.

Thanks to NumPy this is not much slower than using SciPy's ODE solver, 119 ms vs. 152 ms. Of course these run on two different algorithms and therefore this result is not a comparison of performances between SciPy and Python+NumPy, rather it is a comparison of two different ideas, which usually happen in real life.

One possibility to optimize the above routine is that we don't have to find all the roots of a given complex polynomial, but just the one nearest to the previous solution, and this is the perfect setting for using Newton's method to find a root of the polynomial.

Method 2.2: Implemeting Newton's method with Numba

Because we find the solutions incrementally starting from the initial condition, we can use Newton's method: for given \((z, x) = (z_i, x_i)\) that satisfy \(f(z_i, x_i) = 0\), we want to find \(x_{i+1}\) such that \(z_{i+1} = z_{i} + \frac{\partial z}{\partial t} \Delta t\) and \(f(z_{i+1}, x_{i+1}) = 0\) where \(\Delta t\) is a step size of the flow vector from the differential equation. Therefore the task boils down to implemeting Newton's method in this context. First we try implementing it using Numba. A simplified version of the relevant code in http://github.com/chan-y-park/loom/blob/master/loom/geometry.py is given below.

@numba.jit(nopython=True)
def f_df_at_zx(N, f_k_n_czes, f_k_d_czes, z_0, x_0):
    """
    Calculate f(z_0, x_0) and df(z_0, x_0)/dx,
    which are used in Newton's method
    to find the root of f(z_0, x) near x = x_0.

    f_k_n_czes = [(k, c_n, e_n), ...],
    f_k_d_czes = [(k, c_d, e_d), ...],
    where
        f(z, x) = x^N + ... + (c_n*z^e_n + ...) / (c_d*z^e_d + ...) * x^k + ...
    """
    f_0 = x_0 ** N
    df_0 = N * (x_0 ** (N - 1))
    f_ns = numpy.zeros(N, dtype=numpy.complex128)
    f_ds = numpy.zeros(N, dtype=numpy.complex128)

    for k, c, e in f_k_n_czes:
        f_ns[k] += c * (z_0 ** e)

    for k, c, e in f_k_d_czes:
        f_ds[k] += c * (z_0 ** e)

    for k in range(N):
        f_k = f_ns[k] / f_ds[k]
        f_0 += f_k * (x_0 ** k)
        if k > 0:
            df_0 += k * f_k * x_0 ** (k - 1)

    return f_0, df_0


@numba.jit(nopython=True)
def get_x(N, phi_k_n_czes, phi_k_d_czes, z_0, x_0, accuracy, max_steps=100):
    """
    Solve f(z_0, x) = 0 using Newton's method,
    using x = x_0 as the initial guess of the solution.
    """
    step = 0
    x_i = x_0
    while(step < max_steps):
        f_i, df_i = f_df_at_zx(N, phi_k_n_czes, phi_k_d_czes, z_0, x_i)
        Delta = f_i / df_i
        if abs(Delta) < accuracy:
            break
        else:
            x_i -= Delta
        step += 1
    return x_i

There are two remarks that I want to make before we move on to the performace comparison. First of all, as you can see from the above code (and from the documentation of Numba) it is recommended to write a code in a starightforward manner, e.g. using for loops insted of list comprehension, for the Numba compiler to work with less problem.

Secondly, I was not able to compile the entire routine of solving the differential equation due to Numba compile errors that I could not identify where and how to fix. Just the functions implementing Newton's method were compiled with Numba, although this gives a good performace boost as we will see soon. Because one has to write a Python code like C for the code to be compiled with Numba and because it is often more difficult to figure out a Numba compilation error than a C code error, my suggestion would be just try putting @numba.jit at the code to be optimized without much modification of the code, and if it works it's great, but if it does not compile don't try too hard to make it work but rather move on to a different optimization, especically if the code is long and complicated. Even if so, the effort put on writing up the Python code is not in vain as it serves as a good prototype as well as a baseline to compare the results and the performances!

Finally let's run the Numba-compiled code and see how faster the whole thing runs.

When comparing to previous results that take more than 100 ms, we see that Numba gives us a good boost in performance. But more interesting question would be: how it fares against C? In the following we will implement the same routine with C, unlike previous examples where the underlying algorithms are rather different from this one. Therefore it will be a real performace comparison between Numba and C.

Method 2.2: Implemeting Newton's method with C

All the relevant C codes can be found at https://github.com/chan-y-park/loom/tree/master/loom/clibs. Especially https://github.com/chan-y-park/loom/blob/master/loom/clibs/solve.c contains the counterparts of the above Python codes.

#include <stdio.h>
#include <complex.h>
#include "solve.h"

newton_params f_df_dx_0(
    int N,
    diff_params f_n,
    diff_params f_d,
    double complex z_0,
    double complex x_0
) {
    double complex f_[N][2];
    double complex f_0 = cpow(x_0, N);
    double complex df_dx_0 = N * cpow(x_0, N - 1);

    int i;
    int k;
    double complex c;
    double e;

    double complex f_k;

    newton_params newton;

    for(i = 0; i < N; i++) {
        f_[i][0] = 0;
        f_[i][1] = 0;
    }

    for(i = 0; i < f_n.n; i++) {
        k = f_n.k[i];
        c = f_n.c[i];
        e = f_n.e[i];
        f_[k][0] += c * cpow(z_0, e);
    }

    for(i = 0; i < f_d.n; i++) {
        k = f_d.k[i];
        c = f_d.c[i];
        e = f_d.e[i];
        f_[k][1] += c * cpow(z_0, e);
    }

    for(k = 0; k < N; k++) {
        f_k = f_[k][0] / f_[k][1];
        f_0 += f_k * cpow(x_0, k);
        if (k > 0) {
            df_dx_0 += k * f_k * cpow(x_0, (k - 1));
        }
    }
    newton.f = f_0;
    newton.df_dx = df_dx_0;
    return newton;
}

double complex get_x(
    int N,
    diff_params f_n,
    diff_params f_d,
    double complex z_0,
    double complex x_0,
    double accuracy,
    int max_steps
) {
    int step = 0;
    double complex x_i = x_0;
    newton_params newton;
    double complex Delta;

    while (step < max_steps) {
        newton = f_df_dx_0(N, f_n, f_d, z_0, x_i);
        Delta = newton.f / newton.df_dx;
        x_i -= Delta;

        if (cabs(Delta) < accuracy) {
            break;
        } else {
            step++;
        }
    }
    return x_i;
}

You can easily see that these functions look much like the Python code compiled by Numba; in that sense writing a Python code for Numba does not give us a great advantage compared to producing a C code because we cannot freely use Python in full capacity. Of course we have to do an extra work to provide a Python wrapper for a C code if we want to use it in a Python program, but it is not a huge work at all if we work with NumPy arrays and NumPy's ctypes interface, which is discussed in [3]. Now it's time to run this and see how much mileage we get from all the hard work.

It's 81.5 ms / 68.8 ms = 118%, which is quite a significant amount. It is not an overwhelming performance advantage to justify changing all the Numba codes to C codes, considering the overhead of preparing a C-Python interface and the burden of debugging both C codes and Python codes at the same time. But if a new routine should be implemented that is supposed to be more than a few lines of Python code and it will be a critical part regarding the performace of the program, this amount of speed difference is big enough to choose C over Numba.

References

[1]	(1, 2) http://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/

[2]	(1, 2) SciPy 2016 talk

[3]	Using Python ctypes and NumPy ctypeslib

Comparison between Numba and `numpy.ctypeslib`