[WIP] Allow pure numpy array (not dask array) as inputs#90
[WIP] Allow pure numpy array (not dask array) as inputs#90daxiongshu wants to merge 4 commits intodask:mainfrom
Conversation
|
@mrocklin @pentschev I just added one test for now. If it is ok, could you please suggest which other tests I should add |
pentschev
left a comment
There was a problem hiding this comment.
@daxiongshu I added a few requests to make the code easier and more Dask-like, also a few questions on things that aren't clear to me. Please take a look when you have a moment.
|
|
||
|
|
||
| from dask_glm.utils import dot, normalize, scatter_array, get_distributed_client | ||
| from dask_glm.utils import dot, normalize, scatter_array, get_distributed_client, safe_zeros_like |
There was a problem hiding this comment.
Where is safe_zeros_like coming from? I suppose you wanted to from dask.array.utils import zeros_like_safe instead, from https://github.com/dask/dask/blob/48a4d4a5c5769f6b78881adeb1b3973a950e5f43/dask/array/utils.py#L350
| if isinstance(X, da.Array): | ||
| return np.zeros_like(X._meta, shape=shape) | ||
| return np.zeros_like(X, shape=shape) |
There was a problem hiding this comment.
| if isinstance(X, da.Array): | |
| return np.zeros_like(X._meta, shape=shape) | |
| return np.zeros_like(X, shape=shape) | |
| return zeros_like_safe(meta_from_array(X)) |
There was a problem hiding this comment.
You'll also need to from dask.array.utils import meta_from_array at the top.
There was a problem hiding this comment.
Sorry for the late reply, I think I might misunderstand our other conversion. #89 (comment)
This PR intends to enable dask-glm to deal with pure numpy arrays. Please let me know if not so and dask-glm should only accept dask arrays.
dask-glm/dask_glm/algorithms.py
Lines 100 to 101 in 7b2f85f
Let's say the input X is a pure numpy or cupy array, not a dask array. beta = np.zeros_like(X._meta) will be an error. The safe_zeros_like (bad naming) I implemented will check if X is a pure numpy/cupy array or a dask array and return a pure numpy/cupy array. In contrast, da.utils.zeros_like_safe returns a dask array. In this case the beta should be a pure numpy/cupy array.
Let me know if this clears things up. Thank you!
There was a problem hiding this comment.
The
safe_zeros_like(bad naming) I implemented will check ifXis apure numpy/cupy arrayor adask arrayand return apure numpy/cupy array.
That's exactly what meta_from_array does. It will return an array of the type _meta has (i.e., chunk type), so if the input is a NumPy array or a Dask array backed by NumPy, the result is an empty numpy.ndarray, and if the input is a CuPy array or a Dask array backed by CuPy, the result is an empty cupy.ndarray.
In contrast,
da.utils.zeros_like_safereturns adask array.
That isn't necessarily true, it will only return a Dask array if the reference array is a Dask array. Because we're getting the underlying chunk type with meta_from_array, the resulting array will be either a NumPy or CuPy array.
There was a problem hiding this comment.
Aha, that works! I will make the changes.
|
|
||
| @dispatch(object) | ||
| def add_intercept(X): | ||
| return np.concatenate([X, np.ones_like(X, shape=(X.shape[0], 1))], axis=1) |
There was a problem hiding this comment.
| return np.concatenate([X, np.ones_like(X, shape=(X.shape[0], 1))], axis=1) | |
| return np.concatenate([X, ones_like_safe(X, shape=(X.shape[0], 1))], axis=1) |
There was a problem hiding this comment.
Also needs from dask.array.utils import ones_like_safe.
| X, y = dask.compute(X, y) | ||
| lr = LogisticRegression(fit_intercept=fit_intercept) | ||
| lr.fit(X, y) | ||
| lr.predict(X) |
There was a problem hiding this comment.
I don't think I understand this test. When is is_numpy the case in a real-world example, IOW, will you ever have X and y be pure NumPy arrays that's worth testing with LogisticRegression? I assumed you'd only have Dask arrays (backed by Sparse or not).
There was a problem hiding this comment.
That's exactly what I tried to do, where both X and y are pure numpy/cupy arrays. Is that a feature we want? The current dask-glm only accepts dask arrays.
There was a problem hiding this comment.
I don't think that's a feature we need to support explicitly, I believe anybody using dask-glm would want to use Dask arrays rather than pure NumPy/CuPy ones.
Currently
dask_glm.estimatorsonly acceptsdask.arrayas inputs due to the line below and other places where._metais accessed without checking the data type.dask-glm/dask_glm/estimators.py
Line 67 in 7b2f85f
dask-glm/dask_glm/utils.py
Lines 120 to 124 in 7b2f85f
Click to see the example code and error
Code:
Error:
This PR allows numpy arrays (not dask numpy array) as input directly.