Conversation
|
Still working on a change to the strides section, its a bit more complicated than it seems at first because arrow does not have a logical type system and "shape + permutation" can mean 2 different things Edit: done in commit 53ad802 |
| Nullability exists only at the tensor level: within a tensor array, an individual tensor may be | ||
| null, but elements within a tensor may not be. This is because tensor operations like matmul cannot | ||
| be efficiently implemented over nullable elements, and most tensor libraries (e.g., PyTorch) do not | ||
| support per-element nulls either. |
There was a problem hiding this comment.
commenting here but maybe it should go on the previous PR?
IDK how arrow does it, but I don't think that's necessarily true.
Most vectorized compute just runs through null values that are zeroed out, IDK what's how you matmul the validity itself, but I think that's a reasonable thing
There was a problem hiding this comment.
I think interpretation of NULLs is context dependent. If NULL means "there was no data observed at this position" and you're doing a weighted sum of the features, treating NULLs as zero is probably the right choice. The result is indeed the count of what you observed. You can't infer anything about things you did not observe.
On the other hand, if NULL means "there is some data here but for technical reasons it was unrecoverable" and you're doing a linear regression, you probably want to replace NULL by a mean value over some dimension(s). I don't have a good linear regression example, but suppose you flip one hundred coins and record heads as 1 and tails as 0. Suppose further that you lose 10 coins before observing them. If you compute the sum of this vector with NULL as zeros you'll conclude the coins are tails-biased! If you compute the sum of this vector with NULL as the sample mean, you'll have an unbiased estimate of the coin's heads/tails probability.
IMO, matmul, sum, etc. should only be defined on tensors with non-nullable elements. I suppose null elements are fine? if they're representable in torch (I think they are not?).
Numpy is able to represent them when you use the catchall-object-dtype, but if you request primitive types it converts them to NaNs.
In [8]: np.array([1., None])
Out[8]: array([1.0, None], dtype=object)
In [9]: np.array([1., None], dtype=float)
Out[9]: array([ 1., nan])
In [10]: np.array([1., None], dtype=np.dtype('f4'))
Out[10]: array([ 1., nan], dtype=float32)
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
|
|
||
| Physical shape favors Arrow compatibility and simpler stride math. Logical shape favors | ||
| NumPy/PyTorch compatibility and is arguably more intuitive for our users since Vortex has a logical | ||
| type system. |
There was a problem hiding this comment.
FWIW, I think torch/numpy integration matters more for tensors than arrow compatibility. There's no linear algebra library that natively works on arrow arrays.
There was a problem hiding this comment.
I agree, and the conversion will be cheap regardless
Rendered
Some revisions from #24
This also moves the RFC into the
accepteddirectory.I'll just keep this named
tensorsince future RFCs can be called variable or sparse tensors.The only change that was not directly because of the comments on the last PR was a change to the strides section, because some of the description was incorrect.