The sequifier preprocess command transforms raw tabular data (CSV or Parquet) into the specific sequence format required for training causal transformer models. It handles windowing, data splitting (train/validation/test), categorical encoding, and numerical standardization.
sequifier preprocess --config-path configs/preprocess.yamlThe configuration is defined in a YAML file (e.g., preprocess.yaml). Below are the available fields, their requirements, and their functions.
| Field | Type | Mandatory | Default | Description |
|---|---|---|---|---|
project_root |
str |
Yes | - | The root directory of your Sequifier project. Usually . |
data_path |
str |
Yes | - | Path to the raw input file or folder. |
read_format |
str |
No | csv |
Format of input data (csv, parquet). |
write_format |
str |
No | parquet |
Format of output data (csv, parquet, pt). |
merge_output |
bool |
No | true |
Whether to merge split files into single files or keep them sharded. |
continue_preprocessing |
bool |
No | false |
If true, resumes a job that was interrupted (requires folder input). |
Important Constraint on
write_format:
- If
write_formatispt(PyTorch tensors),merge_outputmust befalse. This sharded format is required for distributed training on large datasets.- If
write_formatiscsvorparquet,merge_outputmust betrue.
| Field | Type | Mandatory | Default | Description |
|---|---|---|---|---|
selected_columns |
list[str] |
No | null |
A specific list of columns to process. If null, all columns (except metadata) are processed. |
max_rows |
int |
No | null |
Limits processing to the first N rows. Useful for rapid debugging. |
metadata_config_path |
Optional[str] |
No | null |
use a preexisting metadata config path for tokenizing discrete columns and standardising real-valued columns |
use_precomputed_maps |
list[str] |
No | null |
If not null, enforces the use of precomputed maps for the variables in the list. |
| Field | Type | Mandatory | Default | Description |
|---|---|---|---|---|
seq_length |
int |
Yes | - | The length of the context window (history) fed into the model. |
split_ratios |
list[float] |
Yes | - | Proportions for data splits (e.g., [0.8, 0.1, 0.1] for train/val/test). Must sum to 1.0. |
stride_by_split |
list[int] |
No | [seq_length]*N |
The step size used to slide the window for each split. Corresponds to split_ratios. |
subsequence_start_mode |
str |
No | distribute |
Strategy for selecting start indices (distribute or exact). |
| Field | Type | Mandatory | Default | Description |
|---|---|---|---|---|
seed |
int |
No | 1010 |
Random seed for reproducibility. |
n_cores |
int |
No | Max Cores | Number of CPU cores to use for parallel processing. |
batches_per_file |
int |
No | 1024 |
Only used when write_format: pt. Controls how many sequences are packed into one .pt file. |
process_by_file |
bool |
No | true |
Memory optimization. If true, processes one input file at a time. |
- Choose
parquet(default): If your dataset is small to medium (fits in RAM) and you want to inspect the preprocessed data easily using standard tools like Pandas or Polars. This produces one file per split (e.g.,data-split0.parquet). - Choose
pt: If your dataset is massive (larger than RAM) or you intend to use Distributed Training (multi-GPU). This format saves data as thousands of small PyTorch tensor files. It allows theSequifierDatasetFromFolderLazyto load data on demand without clogging memory.
This controls data augmentation and redundancy.
- Stride =
seq_length(Non-overlapping): The model sees every data point exactly once as a target. Training is faster, but the model might miss patterns that cross the window boundary. - Stride = 1 (Maximum Overlap): Maximizes data volume. The model sees every possible sequence. This yields the highest accuracy but significantly increases the size of the preprocessed data and training time.
- Hybrid Approach: It is common practice to set a large stride for the training and validation splits (index 0) to reduce the size on disk of the dataset, and a stride=1 for the test split to evaluate the model on each point in the test set. This supposes that the test split value is low.
- Example:
stride_by_split: [24, 24, 1](assumingseq_length: 48).
- Example:
distribute(Default): The algorithm adjusts the start indices slightly to minimize the overlap of the final subsequence with the previous one, ensuring the data covers the full sequence length as evenly as possible. Recommended for most use cases.exact: Strictly enforces the stride. If the sequence length minus the window size isn't perfectly divisible by the stride, this will raise an error. Use this only if mathematical precision of the sliding window is strictly required by your downstream application or evaluation code.
By default, Sequifier dynamically builds ID maps from the data found in the input file. However, in production systems, you often need a fixed vocabulary to ensure that ID "105" always maps to "Item_X", regardless of the daily training batch.
To use a static vocabulary:
- Create a folder
configs/id_maps/in your project root. - Add JSON files named
{COLUMN_NAME}.json. - The format must be a dictionary mapping values to integers starting at 2.
Reserved Indices:
- 0: Reserved for
unknown(padding/missing).- 1: Reserved for
other(unseen values not in your map).- 2+: Your data.
Example configs/id_maps/itemId.json:
{
"apple": 2,
"banana": 3,
"cherry": 4
}
-----
## Outputs
After running `preprocess`, the following are generated:
1. **Data Files:** Located in `data/`. Depending on your configuration, these will be `[NAME]-split0.parquet` (Training), `[NAME]-split1.parquet` (Validation), etc., or folders containing `.pt` files.
2. **Metadata Config:** Located in `configs/metadata_configs/[NAME].json`.
* **Crucial:** This file contains the integer mappings for categorical variables (`id_maps`) and normalization stats for real variables (`selected_columns_statistics`).
* **Next Step:** You must link this file path in your `train.yaml` and `infer.yaml` under `metadata_config_path`.
## 5\. Advanced: Custom ID Mapping
By default, Sequifier automatically generates integer IDs for categorical columns starting from index 2 (indices 0 and 1 are reserved for system use, such as "unknown" values).
If you need to enforce specific integer mappings (e.g., to maintain consistency across different training runs or datasets), you can provide **precomputed ID maps**.
1. Create a folder named `id_maps` inside your configs directory: `configs/id_maps/`.
2. Create a JSON file named exactly after the column you want to map (e.g., `my_column_name.json`).
3. The JSON file must contain a key-value dictionary where keys are the raw values and values are the integer IDs.
**Constraints:**
* Integer IDs must start at **2** or higher.
* IDs **0** and **1** are reserved.
**Example `configs/id_maps/category_col.json`:**
```json
{
"cat": 2,
"dog": 3,
"mouse": 4
}