See also: https://semango.org/guide/ingestion
Semango currently supports row-oriented CSV and JSON ingestion. Parquet and SQLite are not implemented in the current codebase.
| Format | Extensions | Reader backend |
|---|---|---|
| CSV | .csv, .tsv |
Go std-lib encoding/csv (delimiter configurable) |
| JSON array | .json |
streaming json.Decoder |
| JSON Lines | .jsonl |
bufio.Scanner |
Tip: proprietary column separators (TSV, pipe-delimited, …) can be handled via a 4-line custom loader that wraps the CSV reader and plugs into the same helper.
-
internal/ingest/tabular/common.goinspects the first few thousand rows and classifies every column as one oftext,categorical,numeric,datetime,binary. -
For each physical row we build a joined text snippet:
name: Jane Doe \n comment: shipping address missingOnly
textandcategoricalcolumns participate – numeric & date fields are kept verbatim in Bleve keyword/typed fields so they can be filtered(meta.col.age < 30)or boosted(ship_date > now()-30d). -
The snippet is sent to the embedding provider and stored as the vector part of the chunk. Metadata holds every original value under
col.<name>so hybrid search and filters "just work".
Key parameters live under the new tabular: config node:
# semango.yml
...
tabular:
max_rows_embedded: 50000 # hard cap per file
sampling: random # random|stratified
min_text_tokens: 5 # ignore rows with <N textual tokens
delimiter: "\t" # override to support TSVWhen a file exceeds the cap Semango samples rows (either random reservoir or simple stratified).
curl -s -XPOST /api/v1/search -d '{
"query": "rows where payment_status = failed"
}' | jq .results[0]
Hybrid RRF fuses BM25 hits on the keyword filter with semantic neighbours of "failed payments".
| Pro | Con |
|---|---|
| Simple heuristics (no external schema needed) | May mis-classify exotic data types |
| Embeds only columns likely to matter | Very wide text columns (>5 KB) still produce chunky vectors |
| Scales to millions of rows via sampling | Per-row embedding may miss cross-row context (use summary vector) |
- Add column-level embeddings (one vector per unique column) for exploratory schema questions.
- Use IVF-PQ FAISS index automatically when row count ≫ 1 M to cut RAM.
- Hit an LLM to pick "best" 3-5 columns at query time and re-rank candidates.