This is a python package that contains a parallel implementation for
a subset of numpy.einsum functionality.
numpy.einsum is a fantastic function for multiplying numpy arrays.
However, numpy.dot and numpy.tensordot are typically faster, especially if numpy
is linked to a parallel implementation of BLAS:
then numpy.dot and numpy.tensordot will take advantage of the multiple
CPUs, whereas numpy.einsum remains single-threaded.
The trouble is, some einsum products are impossible to express as
dot or tensordot. For example,
numpy.einsum("ijm,kmi->ijk", A, B)
returns a tensor C with Ci,j,k = ∑m Ai,j,m Bk,m,i.
This cannot be expressed as a dot or tensordot because the shared
axis i is included in the output C.
This operation can be expressed in terms of numpy.matmul, and in particular
this example is equivalent to numpy.matmul(A, np.transpose(B)).
However, on my machine, numpy.matmul does not appear to take advantage
of parallel BLAS with multiple cores.
This may eventually change in the future (last year Intel introduced
batched GEMM operations), but in the meantime, you can use einsum2
to parallelize such einsum calls.
einsum2 is also compatible with the autograd package for automatic
differentiation.
Pre-requisites are a C compiler that supports OpenMP, Cython, numpy, and pip.
Assuming you satisfy these requirements, you can install with
pip install .
at the top-level directory of einsum2.
If you have problem installing or running, it’s probably because
of your C compiler.
On OSX in particular, the default compiler clang does not support OpenMP and thus won’t work.
I’ve also occasionally run into issues using Anaconda Python on Linux,
apparently because of the old gcc-4 it was compiled with – however, I’m currently unable to reproduce this error.
In case you are having issues with the C compiler, it is recommended
to install and run einsum2 in a virtual environment with a compatible C compiler.
For example, if you use Anaconda Python, you can do the following:
- Create a new virtual environment named
env_namewithconda create -n env_name python=3.5 anaconda(alternatively, you can usepython=2.7 anaconda). - Switch to the new environment with
source activate env_name. - Do
conda install gccto install the Anaconda distribution ofgcc, which both supports OpenMP and is fully compatible with the Anaconda Python. Note this clobbers your systemgcc, which is why we are using a virtual environment here. - Finally, install
einsum2withCC=gcc pip install .(theCC=gccis required on OSX, otherwise the installation will try to use the defaultclangcompiler that does not support OpenMP).
To test whether the parallelization is working, try
sh test/profile_einsum2.sh, you should see a speedup for
the function calls using multiple threads (OMP_NUM_THREADS > 1).
This computes einsum products that can be expressed
in terms of matmul, transpose, and reshape operations.
It can either have the form
einsum2.einsum2(a, a_sublist, b, b_sublist, out_sublist)
or
einsum2.einsum2(subscripts_str, a, b)
Here a and b are tensors.
In the first format, a_sublist and b_sublist label the indices of a and b,
while out_sublist gives the indices of the output array.
In the second format, subscripts_str is a string specifying the subscripts of a, b, and
the output, and looks like a_subs,b_subs->out_subs.
The official numpy.einsum documentation provides more details about specifying the subscripts; here is an example. The following two calls are equivalent:
einsum2(A, [0,1,2,3], B, [4,1,0], [0,2,4])
and
einsum2("ijkm,nji->ikn", A, B)
and returns a tensor C with Cikn = ∑j,m Aijkm Bnji.
Unlike numpy.einsum, we allow the subscripts in the first format to be a list of
arbitrary hashable keys, so
einsum2(A, ["zero","one","two","three], B, ["four","one","zero"], ["zero","two","four"])
would also be allowed.
einsum2 has several limitations compared to numpy.einsum: it only operates
on two input arrays, does not allow diagonal operations
(repeated subscripts on the same array), and requires the output
subscripts to always be specified.
Unlike the standard einsum, einsum2 will perform computations
in parallel. The number of parallel threads is selected automatically,
but you can also control this with the environment variable
OMP_NUM_THREADS.
To perform the parallel computation, einsum2 will either use numpy.dot (if possible), otherwise it will use a parallel for loop. The advantage of using numpy.dot is that it uses BLAS which is much faster than a for loop. However, you need to make sure numpy is compiled against a parallel BLAS implementation such as MKL or OpenBlas. You won’t need to worry about this for most packaged, precompiled versions of numpy (e.g. Anaconda Python).
This is a parallel implementation of numpy.matmul.
More specifically, for 3-tensors a and b,
einsum2.batched_dot(a, b)
computes numpy.matmul(a,b) in parallel.
batched_dot is only currently implemented for a and b that are 3-tensors.
If the leading dimension has length 1, then batched_dot will use numpy.dot
to take advantage of BLAS.
This is a convenience function for einsum operations on a single array.
In particular,
einsum2.einsum1(in_arr, in_sublist, out_sublist)
returns an array out_arr that is derived from in_arr, but with subscripts given by
out_sublist. In particular, all subscripts of in_sublist not in out_sublist
are summed out, and then the axes of in_arr are rearranged to match out_sublist.
Like einsum2, arbitrary keys are allowed to label the subscripts in einsum1.
Also like einsum2, repeated subscripts (i.e. diagonal operations) are not supported.