Skip to content

chore: initial version#6

Merged
Justar96 merged 17 commits intomainfrom
first-version
Nov 4, 2025
Merged

chore: initial version#6
Justar96 merged 17 commits intomainfrom
first-version

Conversation

@xaviviro
Copy link
Contributor

@xaviviro xaviviro commented Nov 3, 2025

Initial Release: Python TOON Format Implementation v1.0.0

Description

This PR establishes the official Python implementation of the TOON (Token-Oriented Object Notation) format. TOON is a compact, human-readable serialization format designed for passing structured data to Large Language Models with 30-60% token reduction compared to JSON.

This release migrates the complete implementation from the pytoon repository, adds comprehensive CI/CD infrastructure, and establishes the package as python-toon on PyPI.

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Documentation update
  • Bug fix (non-breaking change that fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Related Issues

Initial release - no related issues.

Changes Made

Core Implementation (11 modules, ~1,922 lines)

  • Complete encoder implementation with support for objects, arrays, tabular format, and primitives
  • Full decoder with strict/lenient parsing modes
  • CLI tool for JSON ↔ TOON conversion
  • Type definitions and constants following TOON specification
  • Value normalization for Python-specific types (Decimal, datetime, etc.)

Package Configuration

  • Package name: python-toon (PyPI)
  • Module name: toon_format (Python import)
  • Version: 1.0.0
  • Python support: 3.8-3.14 (including 3.14t free-threaded)
  • Build system: hatchling (modern, PEP 517 compliant)
  • Dependencies: Zero runtime dependencies

CI/CD Infrastructure

  • GitHub Actions workflow for testing across Python 3.8-3.12
  • Automated PyPI publishing via OIDC trusted publishing
  • TestPyPI workflow for pre-release validation
  • Ruff linting and formatting enforcement
  • Type checking with mypy
  • Coverage reporting with pytest-cov

Testing

  • 73 comprehensive tests covering:
    • Encoding: primitives, objects, arrays (tabular and mixed), delimiters, indentation
    • Decoding: basic structures, strict mode, delimiters, length markers, edge cases
    • Roundtrip: encode → decode → encode consistency
    • 100% test pass rate

Documentation

  • Comprehensive README.md with:
    • Installation instructions (pip and uv)
    • Quick start guide
    • Complete API reference
    • CLI usage examples
    • LLM integration best practices
    • Token efficiency comparisons
  • CONTRIBUTING.md with development workflow
  • PR template for future contributions
  • Issue templates for bug reports
  • examples.py with 7 runnable demonstrations

SPEC Compliance

Implementation Details:

  • ✅ YAML-style indentation for nested objects
  • ✅ CSV-style tabular format for uniform arrays
  • ✅ Inline format for primitive arrays
  • ✅ List format for mixed arrays
  • ✅ Length markers [N] for all arrays
  • ✅ Optional # prefix for length markers
  • ✅ Delimiter options: comma (default), tab, pipe
  • ✅ Quoting rules for strings (minimal, spec-compliant)
  • ✅ Escape sequences: \", \\, \n, \r, \t
  • ✅ Primitives: null, true, false, numbers, strings
  • ✅ Strict and lenient parsing modes

Testing

  • All existing tests pass
  • Added new tests for changes
  • Tested on Python 3.8
  • Tested on Python 3.9
  • Tested on Python 3.10
  • Tested on Python 3.11
  • Tested on Python 3.12

Test Output

============================= test session starts ==============================
platform darwin -- Python 3.11.14, pytest-8.4.2, pluggy-1.6.0
collected 73 items

tests/test_decoder.py .................................            [ 45%]
tests/test_encoder.py ........................................      [100%]

============================== 73 passed in 0.03s ==============================

Test Coverage:

  • Encoder: 40 tests covering all encoding scenarios
  • Decoder: 33 tests covering parsing and validation
  • All edge cases, delimiters, and format options tested
  • 100% pass rate

Code Quality

  • Ran ruff check src/toon_format tests - no issues
  • Ran ruff format src/toon_format tests - code formatted
  • Ran mypy src/toon_format - informational only (24 type hints to improve in future)
  • All tests pass: pytest tests/ -v

Linter Output:

$ ruff check src/toon_format tests
All checks passed!

Checklist

  • My code follows the project's coding standards (PEP 8, line length 100)
  • I have added type hints to new code
  • I have added tests that prove my fix/feature works
  • New and existing tests pass locally
  • I have updated documentation (README.md if needed)
  • My changes do not introduce new dependencies
  • I have maintained Python 3.8+ compatibility
  • I have reviewed the TOON specification for relevant sections

Performance Impact

  • No performance impact
  • Performance improvement (describe below)
  • Potential performance regression (describe and justify below)

Performance Characteristics:

  • Encoder: Fast string building with minimal allocations
  • Decoder: Single-pass parsing with minimal backtracking
  • Zero runtime dependencies for optimal load times
  • Suitable for high-frequency encoding/decoding scenarios

Breaking Changes

  • No breaking changes
  • Breaking changes (describe migration path below)

This is the initial release, so no breaking changes apply.

Screenshots / Examples

Basic Usage

from toon_format import encode

# Simple object
data = {"name": "Alice", "age": 30}
print(encode(data))

Output:

name: Alice
age: 30

Tabular Array Example

users = [
    {"id": 1, "name": "Alice", "age": 30},
    {"id": 2, "name": "Bob", "age": 25},
    {"id": 3, "name": "Charlie", "age": 35},
]
print(encode(users))

Output:

[3,]{id,name,age}:
  1,Alice,30
  2,Bob,25
  3,Charlie,35

Token Efficiency

import json
from toon_format import encode

data = {
    "users": [
        {"id": 1, "name": "Alice", "age": 30, "active": True},
        {"id": 2, "name": "Bob", "age": 25, "active": True},
        {"id": 3, "name": "Charlie", "age": 35, "active": False},
    ]
}

json_str = json.dumps(data)
toon_str = encode(data)

print(f"JSON: {len(json_str)} characters")
print(f"TOON: {len(toon_str)} characters")
print(f"Reduction: {100 * (1 - len(toon_str) / len(json_str)):.1f}%")

Output:

JSON: 177 characters
TOON: 85 characters
Reduction: 52.0%

Additional Context

Package Details

Installation

# With pip
pip install python-toon

# With uv (recommended)
uv pip install python-toon

Development Setup

# Clone repository
git clone https://github.com/toon-format/toon-python.git
cd toon-python

# Install with uv
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run linters
ruff check src/toon_format tests
mypy src/toon_format

Key Features

  1. Token Efficiency: 30-60% reduction compared to JSON
  2. Human Readable: YAML-like syntax for objects, CSV-like for arrays
  3. Spec Compliant: 100% compatible with official TOON specification
  4. Type Safe: Full type hints throughout codebase
  5. Well Tested: 73 tests with 100% pass rate
  6. Zero Dependencies: No runtime dependencies
  7. Python 3.8+: Supports Python 3.8 through 3.14t (free-threaded)
  8. Fast: Single-pass parsing, minimal allocations
  9. Flexible: Multiple delimiters, indentation options, strict/lenient modes
  10. CLI Included: Command-line tool for JSON ↔ TOON conversion

Code Quality Notes

Mypy Type Checking: The project currently has 24 mypy type errors that are informational only. The CI is configured with continue-on-error: true for mypy checks, and the pyproject.toml has lenient mypy settings (disallow_untyped_defs = false, check_untyped_defs = false). These type hints can be improved incrementally in future releases without blocking the current functionality.

All runtime behavior is validated through 73 comprehensive tests with 100% pass rate.

Future Roadmap

  • Improve type hint coverage (address 24 mypy warnings)
  • Additional encoding options (custom formatters)
  • Performance optimizations for large datasets
  • Streaming encoder/decoder for very large files
  • Additional language implementations
  • Enhanced CLI features (pretty-printing, validation)

Checklist for Reviewers

  • Code changes are clear and well-documented
  • Tests adequately cover the changes
  • Documentation is updated
  • No security concerns
  • Follows TOON specification
  • Backward compatible (or breaking changes are justified and documented)

Review Focus Areas

  1. Spec Compliance: Verify encoding/decoding matches TOON spec exactly
  2. Edge Cases: Check handling of empty strings, special characters, nested structures
  3. Type Safety: Ensure type hints are accurate and complete
  4. Error Messages: Verify error messages are clear and helpful
  5. Documentation: Confirm examples work as shown
  6. CI/CD: Verify workflows are properly configured for PyPI deployment

@xaviviro xaviviro requested review from a team and johannschopplich as code owners November 3, 2025 09:30
@xaviviro xaviviro closed this Nov 3, 2025
@johannschopplich
Copy link
Contributor

@xaviviro Please keep a PR open instead of closing and opening again. Rather, iterate over your branch in separate comments. The pipeline should re-run at every commit.

@xaviviro xaviviro reopened this Nov 3, 2025
@xaviviro
Copy link
Contributor Author

xaviviro commented Nov 3, 2025

Thanks @johannschopplich for letting me know I could iterate on the PR - I wasn't aware of that! All checks are now passing. ✅

@johannschopplich
Copy link
Contributor

@xaviviro Sure! @toon-format/python-maintainers Please decide whether it's the right approach to do it all in one PR vs. separate PRs for setup/CI etc. Smaller PRs usually make it easier to review and move forward.

Maybe it's even better to make a poll which code base to pick as base and then incorporate what you all have worked on into this new repo. Please discuss beforehand; it's not about first come first served, but the best foundation for this package. Thank you. 🙏

@xaviviro
Copy link
Contributor Author

xaviviro commented Nov 3, 2025

Perfect! I had some time now and created this as a starting point to get things moving. I'm completely open to whatever approach the team decides is best - whether it's using this as a base, starting fresh, or incorporating elements from different implementations.
I agree smaller PRs make more sense for review. Happy to break this down or adjust the approach based on what works best for everyone.
Looking forward to the first official release so I can archive my repo and point everyone to the canonical implementation. Thanks for coordinating this! 🙏

xaviviro and others added 2 commits November 3, 2025 20:51
Keep both reference repositories section and standard Python gitignore structure.

Co-authored-by: Justar96
@xaviviro
Copy link
Contributor Author

xaviviro commented Nov 4, 2025

Dear team @toon-format/python-maintainers @johannschopplich ,

I hope this message finds you well. Since there hasn't been much activity on this PR, I'd like to provide some additional context that might help move things forward.

The implementation I'm proposing here is based on my python-toon package, which has been live on PyPI and has already accumulated over 5,000 downloads with zero reported issues. You can see the download statistics here:

https://pepy.tech/projects/python-toon?timeRange=threeMonths

This track record demonstrates the robustness and reliability of the codebase.

The implementation includes:

  • 73 comprehensive tests with 100% pass rate
  • Full CI/CD pipeline with GitHub Actions
  • Complete TOON spec compliance
  • Production-ready code quality (ruff, mypy)
  • Extensive documentation and examples

I'm open to any alternative approach you might prefer, but I think it's important we move forward with an official Python implementation. Let me know how you'd like to proceed!

Best regards,
Xavi

Justar96 and others added 5 commits November 4, 2025 18:08
## Code Organization
- Add Google-style headers to all 18 source files
  - Copyright (c) 2025 TOON Format Organization
  - SPDX-License-Identifier: MIT
  - Comprehensive module docstrings
- Format all source code with Ruff

## Test Suite Expansion
- Increase test coverage from 78% to 91% (792 tests)
- Add comprehensive test modules:
  - test_security.py: 24 tests for injection prevention and resource exhaustion
  - test_internationalization.py: 24 tests for Unicode/UTF-8 support
  - test_cli.py: 30 integration tests for command-line interface
  - test_scanner.py: 31 tests for scanner module (100% coverage)
  - test_string_utils.py: 42 tests for string utilities (100% coverage)
  - test_normalize_functions.py: 37 tests for normalization (95% coverage)
  - test_parsing_utils.py: Complete parsing utility coverage
- Add 306 official spec compliance tests via test_spec_fixtures.py
- Create test fixture infrastructure with JSON schema validation

## Files Changed
- Modified: All 18 source files in src/toon_format/
- Added: 8 new test modules
- Added: Test fixtures and schema
- Added: New utility module _parsing_utils.py
Features:
- Add benchmark dependency group with tiktoken>=0.4.0 to pyproject.toml
- Export count_tokens, estimate_savings, and compare_formats utilities
- Implement token counting using tiktoken with o200k_base encoding (gpt5/gpt5-mini)

Documentation Updates:
- Add Token Counting & Comparison section to main README with examples
- Update docs/README.md with new utility functions in API reference list
- Add roadmap section announcing planned comprehensive benchmarks
- Add complete Utility Functions section to docs/api.md covering:
  * count_tokens() - Token counting with tiktoken
  * estimate_savings() - JSON vs TOON comparison metrics
  * compare_formats() - Formatted comparison tables
- Add Token Efficiency examples with cost estimation patterns
- Update LLM integration guide with Measuring Token Savings section
- Include cost calculation examples and integration patterns
- Update model references from GPT-4 to gpt5 throughout docs
- Add benchmark disclaimer noting comprehensive benchmarks coming soon

Technical Details:
- Update tokenizer documentation from GPT-4o/GPT-4 to gpt5/gpt5-mini
- Fix TypedDict usage examples in docs/api.md (EncodeOptions uses dict syntax)
- Clarify DecodeOptions is a class while EncodeOptions is a TypedDict
- Add toon-spec/ submodule files (CHANGELOG.md and SPEC.md v1.3)
@Justar96
Copy link
Contributor

Justar96 commented Nov 4, 2025

@johannschopplich @toon-format/python-maintainers
Hey everyone,
I've added a full test suite for compliance [https://github.com/toon-format/spec/tree/main/tests] with 91% coverage, and fixed some encode and decode issues to comply with the main spec.

Code Organization

  • Add Google-style headers to all 18 source files
    • Copyright (c) 2025 TOON Format Organization
    • SPDX-License-Identifier: MIT
    • Comprehensive module docstrings
  • Format all source code with Ruff

Test Suite Expansion

  • Increase test coverage from 78% to 91% (792 tests)
  • Add comprehensive test modules:
    • test_security.py: 24 tests for injection prevention and resource exhaustion
    • test_internationalization.py: 24 tests for Unicode/UTF-8 support
    • test_cli.py: 30 integration tests for command-line interface
    • test_scanner.py: 31 tests for scanner module (100% coverage)
    • test_string_utils.py: 42 tests for string utilities (100% coverage)
    • test_normalize_functions.py: 37 tests for normalization (95% coverage)
    • test_parsing_utils.py: Complete parsing utility coverage
  • Add 306 official spec compliance tests via test_spec_fixtures.py
  • Create test fixture infrastructure with JSON schema validation

Files Changed

  • Modified: All 18 source files in src/toon_format/
  • Added: 8 new test modules
  • Added: Test fixtures and schema
  • Added: New utility module _parsing_utils.py

@johannschopplich johannschopplich changed the title First version chore: initial version Nov 4, 2025
@johannschopplich
Copy link
Contributor

@bpradana @davidpirogov Usually I don't want to interfere with code style and repo setup. However, to interate quickly, please leave a review in the upcoming days for this MR. Otherwise, I'd like to merge in order to move forward quickly. Hope you understand that. 🙂 You can incoporate all the best practices with smaller, incremental PRs.

@Justar96 Lovely! When this PR is merged, feel free to open a new PR to incorporate these changes.

Copy link

@bpradana bpradana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM 🚀

@johannschopplich
Copy link
Contributor

@bpradana You have to leave a review 🙂 – and approve this PR so we can get this merged:
Bildschirmfoto 2025-11-04 um 16 27 51

@xaviviro Your honor to merge afterwards! Thank you for your work. 👏

Copy link

@bpradana bpradana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@bpradana
Copy link

bpradana commented Nov 4, 2025

@johannschopplich my bad, I don't see the approve button, turns out its in the review section 😓

@Justar96 Justar96 merged commit 43fd07b into main Nov 4, 2025
6 checks passed
@johannschopplich
Copy link
Contributor

@bpradana No worries at all! It's kinda complicated on GitHub anyway. I have set up this repo to require 2 reviewers per PR. If that's too strict for the team, I can always lower that.

@Justar96
Copy link
Contributor

Justar96 commented Nov 4, 2025

i think it's good.

@bpradana
Copy link

bpradana commented Nov 4, 2025

@johannschopplich 2 reviewers for now is totally fine, especially in the early stages of development. we can dial it back, if it starts feeling overkill 😁

@xaviviro xaviviro deleted the first-version branch November 4, 2025 18:19
@Justar96
Copy link
Contributor

Justar96 commented Nov 5, 2025

We still need to publish to pypi @johannschopplich @toon-format/python-maintainers

@davidpirogov
Copy link
Contributor

Let’s just hold off for a few days until we stabilize the code base and make sure that we are in compliance with spec and tests.

We still have a lot to migrate and test. We’ll be ready very soon!

@Justar96
Copy link
Contributor

Justar96 commented Nov 5, 2025

I just wanna point out to that since we have no proper plan and any status.

@davidpirogov
Copy link
Contributor

Yeah - fair point. Our plan is documented here: toon-format/toon#54

Probably better make an issue in this repo - we’ll fix up the chaos once we get all the code and content migrated into this repo

@Justar96
Copy link
Contributor

Justar96 commented Nov 5, 2025

Let's create a proper plan? We can collaborate in notion or any way?

@davidpirogov
Copy link
Contributor

Yeah, we need to - let’s stick to GitHub discussions - keeps everything in one place

@johannschopplich
Copy link
Contributor

@davidpirogov Thanks for explaining – I aggree!

@Justar96 Usually, to prevent comments from being buried in closed threads (like this one), you can:

  • Open a discussion for open topics, like general roadmap
  • Or create issues for specific issues/ideas that can be worked on independently

Both help to keep track of what work still needs to be done. It also gives the team the opportunity to choose the person who works best with a task. E.g.: CI integration for GitHub releases and auto-publishing to PyPi.

If you want, you can compare the current state of the repo to the goal of a v1 (using the language-agnostic tests for example) and open a roadmap discussion. Title idea: "Roadmap to v1"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants