Towards Out-of-core ND-Arrays -- Benchmark MatMul
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
tl;dr We benchmark dask on an out-of-core dot product. We also compare and motivate the use of an optimized BLAS.
Here are the results
| Performance (GFLOPS) | In-Memory | On-Disk | 
|---|---|---|
| Reference BLAS | 6 | 18 | 
| OpenBLAS one thread | 11 | 23 | 
| OpenBLAS four threads | 22 | 11 | 
Disclaimer: This post is on experimental buggy code. This is not ready for public use.
Introduction
This is the fourth in a sequence of posts constructing an out-of-core nd-array using NumPy, Blaze, and dask. You can view these posts here:
We now give performance numbers on out-of-core matrix-matrix multiplication.
Matrix-Matrix Multiplication
Dense matrix-matrix multiplication is compute-bound, not I/O bound. We spend most of our time doing arithmetic and relatively little time shuffling data around. As a result we may be able to read large data from disk without performance loss.
When multiplying two $n\times n$ matrices we read $n^2$ bytes but perform $n^3$ computations. There are $n$ computations to do per byte so, relatively speaking, I/O is cheap.
We normally measure speed for single CPUs in Giga Floating Point Operations Per Second (GFLOPS). Lets look at how my laptop does on single-threaded in-memory matrix-matrix multiplication using NumPy.
>>> import numpy as np
>>> x = np.ones((1000, 1000), dtype='f8')
>>> %timeit x.dot(x)  # Matrix-matrix multiplication
10 loops, best of 3: 176 ms per loop
>>> 1000 ** 3 / 0.176 / 1e9  # n cubed computations / seconds / billion
>>> 5.681818181818182
OK, so NumPy’s matrix-matrix multiplication clocks in at 6 GFLOPS more or
less. The np.dot function ties in to the GEMM operation in the BLAS
library on my machine. Currently my numpy just uses reference BLAS. (you can
check this with np.show_config().)
Matrix-Matrix Multiply From Disk
For matrices too large to fit in memory we compute the solution one part at a time, loading blocks from disk when necessary. We parallelize this with multiple threads. Our last post demonstrates how NumPy+Blaze+Dask automates this for us.
We perform a simple numerical experiment, using HDF5 as our on-disk store.
We install stuff
conda install -c blaze blaze
pip install git+https://github.com/ContinuumIO/dask
We set up a fake dataset on disk
>>> import h5py
>>> f = h5py.File('myfile.hdf5')
>>> A = f.create_dataset(name='A', shape=(200000, 4000), dtype='f8',
...                                chunks=(250, 250), fillvalue=1.0)
>>> B = f.create_dataset(name='B', shape=(4000, 4000), dtype='f8',
...                                chunks=(250, 250), fillvalue=1.0)
We tell Dask+Blaze how to interact with that dataset
>>> from dask.obj import Array
>>> from blaze import Data, into
>>> a = into(Array, 'myfile.hdf5::/A', blockshape=(1000, 1000))  # dask things
>>> b = into(Array, 'myfile.hdf5::/B', blockshape=(1000, 1000))
>>> A = Data(a)  # Blaze things
>>> B = Data(b)
We compute our desired result, storing back onto disk
>>> %time into('myfile.hdf5::/result', A.dot(B))
2min 49s
>>> 200000 * 4000 * 4000 / 169 / 1e9
18.934911242
18.9 GFLOPS, roughly 3 times faster than the in-memory solution. At first glance this is confusing - shouldn’t we be slower coming from disk? Our speedup is due to our use of four cores in parallel. This is good, we don’t experience much slowdown coming from disk.
It’s as if all of our hard drive just became memory.
OpenBLAS
Reference BLAS is slow; it was written long ago. OpenBLAS is a modern
implementation. I installed OpenBLAS with my system installer (apt-get) and
then reconfigured and rebuilt numpy. OpenBLAS supports many cores. We’ll show
timings with one and with four threads.
export OPENBLAS_NUM_THREADS=1
ipython
>>> import numpy as np
>>> x = np.ones((1000, 1000), dtype='f8')
>>> %timeit x.dot(x)
10 loops, best of 3: 89.8 ms per loop
>>> 1000 ** 3 / 0.0898 / 1e9  # compute GFLOPS
11.135857461024498
export OPENBLAS_NUM_THREADS=4
ipython
>>> %timeit x.dot(x)
10 loops, best of 3: 46.3 ms per loop
>>> 1000 ** 3 / 0.0463 / 1e9  # compute GFLOPS
21.598272138228943
This is about four times faster than reference. If you’re not already
parallelizing in some other way (like with dask) then you should use a modern
BLAS like OpenBLAS or MKL.
OpenBLAS + dask
Finally we run on-disk our experiment again, now with OpenBLAS. We do this both with OpenBLAS running with one thread and with many threads.
We’ll skip the code (it’s identical to what’s above) and give a comprehensive table of results below.
Sadly the out-of-core solution doesn’t improve much by using OpenBLAS. Acutally when both OpenBLAS and dask try to parallelize we lose performance.
Results
| Performance (GFLOPS) | In-Memory | On-Disk | 
|---|---|---|
| Reference BLAS | 6 | 18 | 
| OpenBLAS one thread | 11 | 23 | 
| OpenBLAS four threads | 22 | 11 | 
tl:dr When doing compute intensive work, don’t worry about using disk, just don’t use two mechisms of parallelism at the same time.
Main Take-Aways
- We don’t lose much by operating from disk in compute-intensive tasks
- Actually we can improve performance when an optimized BLAS isn’t avaiable.
- Dask doesn’t benefit much from an optimized BLAS. This is sad and surprising. I expected performance to scale with single-core in-memory performance. Perhaps this is indicative of some other limiting factor
- One shouldn’t extrapolate too far with these numbers. They’re only relevant for highly compute-bound operations like matrix-matrix multiply
Also, thanks to Wesley Emeneker for finding where we were leaking memory, making results like these possible.
blog comments powered by Disqus