Skip to content

Latest commit

 

History

History
77 lines (58 loc) · 2.7 KB

File metadata and controls

77 lines (58 loc) · 2.7 KB

Tabular Data Ingestion (CSV / JSON / JSONL)

See also: https://semango.org/guide/ingestion

Semango currently supports row-oriented CSV and JSON ingestion. Parquet and SQLite are not implemented in the current codebase.

Supported formats

Format Extensions Reader backend
CSV .csv, .tsv Go std-lib encoding/csv (delimiter configurable)
JSON array .json streaming json.Decoder
JSON Lines .jsonl bufio.Scanner

Tip: proprietary column separators (TSV, pipe-delimited, …) can be handled via a 4-line custom loader that wraps the CSV reader and plugs into the same helper.

How rows become search-able

  1. internal/ingest/tabular/common.go inspects the first few thousand rows and classifies every column as one of text, categorical, numeric, datetime, binary.

  2. For each physical row we build a joined text snippet:

    name: Jane Doe  \n  comment: shipping address missing
    

    Only text and categorical columns participate – numeric & date fields are kept verbatim in Bleve keyword/typed fields so they can be filtered (meta.col.age < 30) or boosted (ship_date > now()-30d).

  3. The snippet is sent to the embedding provider and stored as the vector part of the chunk. Metadata holds every original value under col.<name> so hybrid search and filters "just work".

Vector explosion control

Key parameters live under the new tabular: config node:

# semango.yml
...
tabular:
  max_rows_embedded: 50000   # hard cap per file
  sampling: random           # random|stratified
  min_text_tokens: 5         # ignore rows with <N textual tokens
  delimiter: "\t"          # override to support TSV

When a file exceeds the cap Semango samples rows (either random reservoir or simple stratified).

Query examples

curl -s -XPOST /api/v1/search -d '{
  "query": "rows where payment_status = failed"
}' | jq .results[0]

Hybrid RRF fuses BM25 hits on the keyword filter with semantic neighbours of "failed payments".

Trade-offs

Pro Con
Simple heuristics (no external schema needed) May mis-classify exotic data types
Embeds only columns likely to matter Very wide text columns (>5 KB) still produce chunky vectors
Scales to millions of rows via sampling Per-row embedding may miss cross-row context (use summary vector)

Extending further

  • Add column-level embeddings (one vector per unique column) for exploratory schema questions.
  • Use IVF-PQ FAISS index automatically when row count ≫ 1 M to cut RAM.
  • Hit an LLM to pick "best" 3-5 columns at query time and re-rank candidates.