[WIP] Allow pure numpy array (not dask array) as inputs by daxiongshu · Pull Request #90 · dask/dask-glm

daxiongshu · 2020-10-29T06:11:03Z

Currently dask_glm.estimators only accepts dask.array as inputs due to the line below and other places where ._meta is accessed without checking the data type.

dask-glm/dask_glm/estimators.py

Line 67 in 7b2f85f

if is_dask_array_sparse(X):

dask-glm/dask_glm/utils.py

Lines 120 to 124 in 7b2f85f

    
           def is_dask_array_sparse(X): 
        
               """ 
        
               Check using _meta if a dask array contains sparse arrays 
        
               """ 
        
               return isinstance(X._meta, sparse.SparseArray)

Click to see the example code and error

Code:

from dask_glm.estimators import LogisticRegression
import numpy
x = numpy.random.rand(10,4)
y = numpy.random.rand(10)

lr = LogisticRegression()
lr.fit(x,y)

Error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-14-e644bf405118> in <module>
----> 1 lr.fit(x,y)

~/rapids/daskml_cupy/dask-glm/dask_glm/estimators.py in fit(self, X, y)
     65         X_ = self._maybe_add_intercept(X)
     66         fit_kwargs = dict(self._fit_kwargs)
---> 67         if is_dask_array_sparse(X):
     68             fit_kwargs['normalize'] = False
     69 

~/rapids/daskml_cupy/dask-glm/dask_glm/utils.py in is_dask_array_sparse(X)
    122     Check using _meta if a dask array contains sparse arrays
    123     """
--> 124     return isinstance(X._meta, sparse.SparseArray)
    125 
    126 

AttributeError: 'numpy.ndarray' object has no attribute '_meta'

This PR allows numpy arrays (not dask numpy array) as input directly.

daxiongshu · 2020-10-29T12:58:40Z

@mrocklin @pentschev I just added one test for now. If it is ok, could you please suggest which other tests I should add numpy input? Thank you!

daxiongshu · 2020-10-29T12:59:49Z

~~I think I'm going to finish this first and then move on to #89~~
Not really. I'll move on to #89

pentschev

@daxiongshu I added a few requests to make the code easier and more Dask-like, also a few questions on things that aren't clear to me. Please take a look when you have a moment.

pentschev · 2020-11-02T21:30:12Z

dask_glm/algorithms.py



-from dask_glm.utils import dot, normalize, scatter_array, get_distributed_client
+from dask_glm.utils import dot, normalize, scatter_array, get_distributed_client, safe_zeros_like


Where is safe_zeros_like coming from? I suppose you wanted to from dask.array.utils import zeros_like_safe instead, from https://github.com/dask/dask/blob/48a4d4a5c5769f6b78881adeb1b3973a950e5f43/dask/array/utils.py#L350

pentschev · 2020-11-02T21:31:54Z

dask_glm/utils.py

+    if isinstance(X, da.Array):
+        return np.zeros_like(X._meta, shape=shape)
+    return np.zeros_like(X, shape=shape)


Suggested change

if isinstance(X, da.Array):

return np.zeros_like(X._meta, shape=shape)

return np.zeros_like(X, shape=shape)

return zeros_like_safe(meta_from_array(X))

You'll also need to from dask.array.utils import meta_from_array at the top.

Sorry for the late reply, I think I might misunderstand our other conversion. #89 (comment)

This PR intends to enable dask-glm to deal with pure numpy arrays. Please let me know if not so and dask-glm should only accept dask arrays.

dask-glm/dask_glm/algorithms.py

Lines 100 to 101 in 7b2f85f

beta = np.zeros_like(X._meta, shape=p)

Let's say the input X is a pure numpy or cupy array, not a dask array. beta = np.zeros_like(X._meta) will be an error. The safe_zeros_like (bad naming) I implemented will check if X is a pure numpy/cupy array or a dask array and return a pure numpy/cupy array. In contrast, da.utils.zeros_like_safe returns a dask array. In this case the beta should be a pure numpy/cupy array.

Let me know if this clears things up. Thank you!

The safe_zeros_like (bad naming) I implemented will check if X is a pure numpy/cupy array or a dask array and return a pure numpy/cupy array.

That's exactly what meta_from_array does. It will return an array of the type _meta has (i.e., chunk type), so if the input is a NumPy array or a Dask array backed by NumPy, the result is an empty numpy.ndarray, and if the input is a CuPy array or a Dask array backed by CuPy, the result is an empty cupy.ndarray.

In contrast, da.utils.zeros_like_safe returns a dask array.

That isn't necessarily true, it will only return a Dask array if the reference array is a Dask array. Because we're getting the underlying chunk type with meta_from_array, the resulting array will be either a NumPy or CuPy array.

Aha, that works! I will make the changes.

pentschev · 2020-11-02T21:34:03Z

dask_glm/utils.py


+@dispatch(object)
+def add_intercept(X):
+    return np.concatenate([X, np.ones_like(X, shape=(X.shape[0], 1))], axis=1)


Suggested change

return np.concatenate([X, np.ones_like(X, shape=(X.shape[0], 1))], axis=1)

return np.concatenate([X, ones_like_safe(X, shape=(X.shape[0], 1))], axis=1)

Also needs from dask.array.utils import ones_like_safe.

pentschev · 2020-11-02T21:47:35Z

dask_glm/tests/test_estimators.py

+        X, y = dask.compute(X, y)
    lr = LogisticRegression(fit_intercept=fit_intercept)
    lr.fit(X, y)
    lr.predict(X)


I don't think I understand this test. When is is_numpy the case in a real-world example, IOW, will you ever have X and y be pure NumPy arrays that's worth testing with LogisticRegression? I assumed you'd only have Dask arrays (backed by Sparse or not).

That's exactly what I tried to do, where both X and y are pure numpy/cupy arrays. Is that a feature we want? The current dask-glm only accepts dask arrays.

I don't think that's a feature we need to support explicitly, I believe anybody using dask-glm would want to use Dask arrays rather than pure NumPy/CuPy ones.

Thank you! I'll prioritize #89 then.

daxiongshu added 4 commits October 28, 2020 20:23

fix is_dask_array_sparse

4bfdc01

numpy works. cupy works except admm & lbfgs

2ad455f

add one test for numpy input

9a8170c

fix test_fit

1af0b03

pentschev suggested changes Nov 2, 2020

View reviewed changes

daxiongshu changed the title ~~[WIP] Allow numpy array (not dask array) as inputs~~ [WIP] Allow pure numpy array (not dask array) as inputs Nov 11, 2020

daxiongshu mentioned this pull request Nov 11, 2020

[Review] Allow lbfgs and admm with dask cupy inputs #89

Merged

Base automatically changed from master to main February 10, 2021 01:06

	def is_dask_array_sparse(X):
	"""
	Check using _meta if a dask array contains sparse arrays
	"""
	return isinstance(X._meta, sparse.SparseArray)



		from dask_glm.utils import dot, normalize, scatter_array, get_distributed_client
		from dask_glm.utils import dot, normalize, scatter_array, get_distributed_client, safe_zeros_like

	return np.concatenate([X, np.ones_like(X, shape=(X.shape[0], 1))], axis=1)
	return np.concatenate([X, ones_like_safe(X, shape=(X.shape[0], 1))], axis=1)

Uh oh!

Conversation

daxiongshu commented Oct 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daxiongshu commented Oct 29, 2020

Uh oh!

daxiongshu commented Oct 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pentschev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daxiongshu Nov 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

daxiongshu commented Oct 29, 2020 •

edited

Loading

daxiongshu commented Oct 29, 2020 •

edited

Loading

daxiongshu Nov 11, 2020 •

edited

Loading