Support categorical data with many categories

- [x] Refactor `training.py` to use `Table` objects.
- [ ] Refactor `serving.py` to use `Table` objects.
  - [ ] Support `Table` objects for multi-row inputs.
  - [ ] Support `Table` objects for single-row inputs. (Is this efficient?)
- [ ] Support sparse `categorical` data in `Table` objects.
- [ ] Generalize `format.py` to optionally create sparse `categorical` data.
- [ ] Support sparse `categorical` data in training.
- [ ] Support sparse `categorical` data in serving.

## Why?

Categorical features with many categories are useful for modeling random effects, e.g. modeling the zip code of voters, or the clinician id of medical diagnoses.

Also, support for two feature types (initially two very similar feature types) will make it much easier to support a third type such as real/normal data in #22 .

## How?

The main bottleneck in TreeCat is space and time used to process internal ragged data:
- categoricals are represented as one-hot vectors in a ragged array, hence take `O(#cats)` space
- these one-hot vectors are processed cell-by-cell in both training and serving

To reduce these costs, the internal format can split from one datatype (multinomial) to two datatypes (multinomial and categorical), where categorical data is restricted to zero or one observation. Some plumbing already exists to pass a `feature_types` vector to the trainer, and to represent internal data as a `Table` object with heterogeneous data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support categorical data with many categories #29

Why?

How?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support categorical data with many categories #29

Description

Why?

How?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions