Conversation
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
danking
left a comment
There was a problem hiding this comment.
Seems great over all. I'm excited for tensors to land in Vortex.
|
|
||
| ## Summary | ||
|
|
||
| We would like to add a Tensor type to Vortex as an extension over `FixedSizeList`. This RFC proposes |
There was a problem hiding this comment.
Perhaps worth explicitly calling this FixedShapeTensor since it's not unreasonable to also want variable shape tensors (but of fixed dimension). For example, in genetics, we often want to take the ~100M rows of genetic variants and collapse into ~30K genes and, for each gene, construct a matrix of genotypes and run a regression. Those matrices always have the same dimensionality (2) but their shape varies (in this case, the sample axis is always the same, N_SAMPLES, but the genetic variant access depends on the size of that gene which varies from a few hundred base pairs (SRY) to 30,000 base pairs (TITIN).
In the future, I can imagine we'll have both FixedSizeTensor<f32, (a, b, c)> and Tensor<f32, 3> (named tbd).
There was a problem hiding this comment.
Agreed, I think we basically want to replicate both of Arrow's fixed and variable size tensors.
|
|
||
| ### Element Type | ||
|
|
||
| We restrict tensor element types to `Primitive` and `Decimal`. Tensors are fundamentally about dense |
There was a problem hiding this comment.
Why Decimal? That seems bizarre to me. Are there any fast implementation of matmul for arrays of Decimals?
There was a problem hiding this comment.
Honestly, support for fast matmul of fixed-point types was also pretty garbage last time I looked. Does anyone need fixed-point matrices?
There was a problem hiding this comment.
Yeah if we're going to restrict it, let's just say Primitive for now.
| #### Tensors in Vortex | ||
|
|
||
| In the current version of Vortex, there are two ways to represent fixed-shape tensors using the | ||
| `FixedSizeList` `DType`, and neither seems satisfactory. |
There was a problem hiding this comment.
Am I allowed to implement a SparseTensorArray whose dtype is Tensor but whose layout is not a FixedSizeList of the right size?
| ### Validity | ||
|
|
||
| We define two layers of nullability for tensors: the tensor itself may be null (within a tensor | ||
| array), and individual elements within a tensor may be null. However, we do not support nulling out |
There was a problem hiding this comment.
Why allow the elements to be null?
IMO, the main reason to use a Tensor type is so that you can define operations like matmul and I worry that we can't efficiently implement matmul on a nullable type like f32?.
There was a problem hiding this comment.
FWIW: I feel pretty strongly that we shouldn't support nullable elements of a tensor.
There was a problem hiding this comment.
It's always something we can relax later, so I'm in favor of restricting this now
| integers, ellipses, boolean arrays, etc.). It supports operations like canonicalization, shape | ||
| inference, and re-indexing onto array chunks. We will want to implement tensor compute expressions | ||
| in Vortex that are similar to the operations ndindex provides — for example, computing the result | ||
| shape of a slice or translating a logical index into a physical offset. |
There was a problem hiding this comment.
Also worth noting xarray. That was where I first encountered the idea of named dimensions. It also has a notion of "coordinates" which are "marginal" arrays. For example, you might have a matrix of temperature values on the surface of the earth. The rows and columns of that matrix could have coordinate values that indicate the latitudes and longitudes associated with the rows and columns.
|
|
||
| ### Academic work | ||
|
|
||
| - **TACO (Tensor Algebra Compiler)** separates the tensor storage format from the tensor program. |
There was a problem hiding this comment.
Taco is really great work! I guess I think of it more as a system for generating fast matmul kernels given the physical layout of two arrays.
There was a problem hiding this comment.
Yeah, could be interesting to implement a tensor array that uses these sparse layouts though
|
|
||
| ## Summary | ||
|
|
||
| We would like to add a Tensor type to Vortex as an extension over `FixedSizeList`. This RFC proposes |
There was a problem hiding this comment.
Agreed, I think we basically want to replicate both of Arrow's fixed and variable size tensors.
|
|
||
| ### Element Type | ||
|
|
||
| We restrict tensor element types to `Primitive` and `Decimal`. Tensors are fundamentally about dense |
There was a problem hiding this comment.
Yeah if we're going to restrict it, let's just say Primitive for now.
| array), and individual elements within a tensor may be null. However, we do not support nulling out | ||
| entire sub-dimensions of a tensor (e.g., marking a whole row or slice as null). | ||
|
|
||
| The validity bitmap is flat (one bit per element) and follows the same contiguous layout as the |
There was a problem hiding this comment.
I'm not sure what this sentence is saying? It sounds like tensors store additional validity on top of FSL. But actually we're just saying a tensor uses FSL as its storage type?
| /// Optional names for each dimension. Each name corresponds to a dimension in the `shape`. | ||
| /// | ||
| /// If names exist, there must be an equal number of names to dimensions. | ||
| dim_names: Option<Vec<String>>, |
There was a problem hiding this comment.
Vec<Option<String>>? Not sure...
There was a problem hiding this comment.
We want to do it this way since this is what arrow has, and also I personally do not want to deal with some dimensions being named and others not named.
|
|
||
| ### Academic work | ||
|
|
||
| - **TACO (Tensor Algebra Compiler)** separates the tensor storage format from the tensor program. |
There was a problem hiding this comment.
Yeah, could be interesting to implement a tensor array that uses these sparse layouts though
[Rendered](https://github.com/vortex-data/rfcs/blob/ct/tensor-revise/accepted/0024-tensor.md) Some revisions from #24 This also moves the RFC into the `accepted` directory. I'll just keep this named `tensor` since future RFCs can be called variable or sparse tensors. The only change that was not directly because of the comments on the last PR was a change to the strides section, because some of the description was incorrect. --------- Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Rendered
We would like to add a Tensor type to Vortex as an extension over
FixedSizeList. This RFC proposes the design of a fixed-shape tensor with contiguous backing memory.Edit: We merged this early so opened a new followup PR: #25