arieradle · arieradle · Mar 12, 2026 · Mar 12, 2026 · Mar 12, 2026 · Mar 12, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -118,6 +118,7 @@ jobs:
         env:
           GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
           GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
+          HUGGING_FACE_API: ${{ secrets.HUGGING_FACE_API }}
         run: pytest tests/integrations/ -v --cov=shekel --cov-report=xml --cov-append
 
       - name: Upload integration coverage

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Added
+- **Google Gemini Provider Adapter** (`shekel/providers/gemini.py`) — Native support for the `google-genai` SDK
+  - Patches `google.genai.models.Models.generate_content` (non-streaming) and `generate_content_stream` (streaming) as two separate methods
+  - Token extraction from `response.usage_metadata.prompt_token_count` / `candidates_token_count`
+  - Model name captured from `model` kwarg before the call (not available in Gemini response objects)
+  - New pricing entries: `gemini-2.0-flash`, `gemini-2.5-flash`, `gemini-2.5-pro`
+  - Install via `pip install shekel[gemini]`
+- **HuggingFace Provider Adapter** (`shekel/providers/huggingface.py`) — Support for `huggingface_hub.InferenceClient`
+  - Patches `InferenceClient.chat_completion` (the underlying method for `.chat.completions.create`)
+  - OpenAI-compatible token extraction (`usage.prompt_tokens` / `usage.completion_tokens`)
+  - Graceful handling when models don't return usage in streaming responses
+  - Install via `pip install shekel[huggingface]`
+- **Integration tests** for both new adapters with real API calls (skip gracefully on quota errors)
+- **Examples**: `examples/gemini_demo.py`, `examples/huggingface_demo.py`
+- **Documentation**: `docs/integrations/gemini.md`, `docs/integrations/huggingface.md`
+
 ## [0.2.5] - 2026-03-11
 
 ### Added

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,11 @@
+# CLAUDE.md — Development Guidelines
+
+## Test File Naming
+
+Tests must be organized **by domain**, not by implementation unit or coverage goal.
+
+- **Good**: `test_openai_wrappers.py`, `test_gemini_wrappers.py`, `test_fallback.py`
+- **Bad**: `test_patch_coverage.py`, `test_patching.py`, `test_coverage_for_x.py`
+
+Name test files after the feature or domain being exercised, not after the module
+being covered or the motivation for writing the tests.
diff --git a/docs/installation.md b/docs/installation.md
@@ -35,6 +35,22 @@ If you're using models from both providers:
 pip install shekel[all]
 ```
 
+### Google Gemini
+
+For Google Gemini via the `google-genai` SDK:
+
+```bash
+pip install shekel[gemini]
+```
+
+### HuggingFace Inference API
+
+For HuggingFace's `InferenceClient`:
+
+```bash
+pip install shekel[huggingface]
+```
+
 ### LiteLLM (100+ Providers)
 
 For access to OpenAI, Anthropic, Gemini, Cohere, Ollama, Azure, Bedrock, and 90+ more through a unified interface:
@@ -107,6 +123,8 @@ Shekel has zero required dependencies beyond the Python standard library. The Op
 | `openai>=1.0.0` | Optional | Track OpenAI API costs |
 | `anthropic>=0.7.0` | Optional | Track Anthropic API costs |
 | `litellm>=1.0.0` | Optional | Track costs via LiteLLM (100+ providers) |
+| `google-genai>=1.0.0` | Optional | Track Google Gemini costs (native SDK) |
+| `huggingface-hub>=0.20.0` | Optional | Track HuggingFace Inference API costs |
 | `tokencost>=0.1.0` | Optional | Support 400+ models |
 | `click>=8.0.0` | Optional | CLI tools |
 

diff --git a/docs/integrations/gemini.md b/docs/integrations/gemini.md
@@ -0,0 +1,159 @@
+# Google Gemini Integration
+
+Shekel tracks costs and enforces budgets for [Google Gemini](https://ai.google.dev/) via the official `google-genai` Python SDK.
+
+## Installation
+
+```bash
+pip install shekel[gemini]
+```
+
+## Why a dedicated adapter?
+
+Unlike OpenAI and Anthropic, Gemini uses its own SDK (`google-genai`) that makes direct API calls — it does **not** route through the OpenAI SDK. Without a dedicated adapter, `budget()` would be completely blind to Gemini spend.
+
+Shekel's `GeminiAdapter` patches two methods at runtime:
+
+- `google.genai.models.Models.generate_content` — non-streaming calls
+- `google.genai.models.Models.generate_content_stream` — streaming calls
+
+All other Shekel features (nested budgets, fallback models, `BudgetExceededError`) work identically.
+
+## Basic Integration
+
+```python
+import google.genai as genai
+from shekel import budget
+
+client = genai.Client(api_key="your-gemini-key")
+
+with budget(max_usd=1.00) as b:
+    response = client.models.generate_content(
+        model="gemini-2.0-flash-lite",
+        contents="Explain quantum computing in one sentence.",
+    )
+    print(response.candidates[0].content.parts[0].text)
+    print(f"Cost: ${b.spent:.6f}")
+```
+
+## Streaming
+
+Gemini streaming uses a **separate method** (`generate_content_stream`) rather than a `stream=True` kwarg — Shekel patches both:
+
+```python
+with budget(max_usd=1.00) as b:
+    for chunk in client.models.generate_content_stream(
+        model="gemini-2.0-flash-lite",
+        contents="List three benefits of Python.",
+    ):
+        if chunk.candidates:
+            print(chunk.candidates[0].content.parts[0].text, end="", flush=True)
+    print()
+    print(f"Cost: ${b.spent:.6f}")
+```
+
+## Nested Budgets
+
+Track costs across multi-step Gemini workflows:
+
+```python
+with budget(max_usd=5.00, name="pipeline") as total:
+    with budget(max_usd=1.00, name="research") as research:
+        client.models.generate_content(
+            model="gemini-2.0-flash-lite",
+            contents="Summarise recent AI trends.",
+        )
+
+    with budget(max_usd=2.00, name="analysis") as analysis:
+        client.models.generate_content(
+            model="gemini-2.0-flash",
+            contents="Analyse the implications of those trends.",
+        )
+
+print(f"Research: ${research.spent:.6f}")
+print(f"Analysis: ${analysis.spent:.6f}")
+print(f"Total:    ${total.spent:.6f}")
+print(total.tree())
+```
+
+## Fallback Models
+
+Switch to a cheaper Gemini model when spend reaches a threshold:
+
+```python
+with budget(
+    max_usd=0.50,
+    fallback={"at_pct": 0.8, "model": "gemini-2.0-flash-lite"},
+) as b:
+    # Starts with gemini-2.0-flash; auto-switches at 80% ($0.40)
+    response = client.models.generate_content(
+        model="gemini-2.0-flash",
+        contents="Write a detailed market analysis.",
+    )
+
+if b.model_switched:
+    print(f"Switched to fallback at ${b.switched_at_usd:.4f}")
+```
+
+!!! note "Same-provider fallback only"
+    Fallback must be another Gemini model. Cross-provider fallback (e.g. Gemini → GPT-4o) is not supported.
+
+## Budget Enforcement
+
+Stop a runaway Gemini loop automatically:
+
+```python
+from shekel import BudgetExceededError
+
+try:
+    with budget(max_usd=2.00) as b:
+        for _ in range(100):  # Shekel stops this when budget runs out
+            client.models.generate_content(
+                model="gemini-2.0-flash-lite",
+                contents="Analyse this document.",
+            )
+except BudgetExceededError as e:
+    print(f"Stopped at ${e.spent:.4f} — saved the rest of the budget.")
+```
+
+## Supported Models and Pricing
+
+| Model | Input (per 1k tokens) | Output (per 1k tokens) |
+|---|---|---|
+| `gemini-2.5-pro` | $0.00125 | $0.01000 |
+| `gemini-2.5-flash` | $0.000075 | $0.00030 |
+| `gemini-2.0-flash` | $0.000075 | $0.00030 |
+| `gemini-2.0-flash-lite` | $0.000075 | $0.00030 |
+| `gemini-1.5-pro` | $0.00125 | $0.00500 |
+| `gemini-1.5-flash` | $0.000075 | $0.00030 |
+
+Shekel uses prefix matching, so `gemini-2.0-flash-001` and similar versioned names resolve automatically.
+
+## Custom Pricing
+
+For models not in the pricing table, pass `price_per_1k_tokens`:
+
+```python
+with budget(
+    max_usd=1.00,
+    price_per_1k_tokens={"input": 0.0001, "output": 0.0003},
+) as b:
+    client.models.generate_content(
+        model="gemini-3-flash-preview",
+        contents="Hello.",
+    )
+```
+
+## Tips for Gemini + Shekel
+
+1. **Use `generate_content_stream` for long responses** — streaming lets you stop mid-generation if the budget is hit
+2. **Wrap at the workflow level**, not per-call, for accurate total cost tracking
+3. **Set `warn_at=0.8`** to log a warning before the budget cap triggers
+4. **Gemini free tier has per-minute limits** — use exponential backoff for production workloads
+
+## Next Steps
+
+- [HuggingFace Integration](huggingface.md)
+- [Nested Budgets](../usage/nested-budgets.md)
+- [Fallback Models](../usage/fallback-models.md)
+- [Extending Shekel](../extending.md)
diff --git a/docs/integrations/huggingface.md b/docs/integrations/huggingface.md
@@ -0,0 +1,144 @@
+# HuggingFace Integration
+
+Shekel tracks costs and enforces budgets for [HuggingFace Inference API](https://huggingface.co/docs/inference-providers/en/index) via the `huggingface-hub` Python SDK's `InferenceClient`.
+
+## Installation
+
+```bash
+pip install shekel[huggingface]
+```
+
+## Why a dedicated adapter?
+
+HuggingFace's `InferenceClient` uses its own HTTP layer — it does **not** call the OpenAI SDK under the hood. Without a dedicated adapter, `budget()` would be completely blind to HuggingFace spend.
+
+Shekel's `HuggingFaceAdapter` patches `InferenceClient.chat_completion` at runtime. Since `client.chat.completions.create()` delegates to `chat_completion` internally, all calls through either interface are tracked automatically.
+
+## Important: Custom Pricing Required
+
+!!! warning "No bundled HuggingFace pricing"
+    HuggingFace hosts thousands of models with varying pricing. Shekel has no standard pricing table for HuggingFace models.
+
+    **Always pass `price_per_1k_tokens` to `budget()`** so Shekel can calculate costs:
+
+    ```python
+    with budget(max_usd=1.00, price_per_1k_tokens={"input": 0.001, "output": 0.001}):
+        ...
+    ```
+
+    If you omit this, `b.spent` will always be `0.0` even though tokens were consumed.
+
+## Basic Integration
+
+```python
+from huggingface_hub import InferenceClient
+from shekel import budget
+
+client = InferenceClient(token="your-hf-token")
+
+with budget(
+    max_usd=1.00,
+    price_per_1k_tokens={"input": 0.001, "output": 0.001},
+) as b:
+    response = client.chat.completions.create(
+        model="meta-llama/Llama-3.2-1B-Instruct",
+        messages=[{"role": "user", "content": "Explain transformers in one sentence."}],
+        max_tokens=50,
+    )
+    print(response.choices[0].message.content)
+    print(f"Cost: ${b.spent:.6f}")
+```
+
+## Streaming
+
+```python
+with budget(
+    max_usd=1.00,
+    price_per_1k_tokens={"input": 0.001, "output": 0.001},
+) as b:
+    full_text = ""
+    for chunk in client.chat.completions.create(
+        model="meta-llama/Llama-3.2-1B-Instruct",
+        messages=[{"role": "user", "content": "List three ML frameworks."}],
+        max_tokens=60,
+        stream=True,
+    ):
+        delta = chunk.choices[0].delta.content
+        if delta:
+            full_text += delta
+            print(delta, end="", flush=True)
+    print()
+    print(f"Cost: ${b.spent:.6f}")
+```
+
+!!! note "Streaming usage availability"
+    Many HuggingFace-hosted models do not return `usage` data in streaming chunks. In that case, `b.spent` will be `0.0` for streaming calls even if tokens were consumed. Non-streaming calls generally do return usage data.
+
+## Nested Budgets
+
+```python
+with budget(
+    max_usd=5.00,
+    name="pipeline",
+    price_per_1k_tokens={"input": 0.001, "output": 0.001},
+) as total:
+    with budget(
+        max_usd=1.00,
+        name="step-1",
+        price_per_1k_tokens={"input": 0.001, "output": 0.001},
+    ) as step1:
+        client.chat.completions.create(
+            model="meta-llama/Llama-3.2-1B-Instruct",
+            messages=[{"role": "user", "content": "Summarise this document."}],
+            max_tokens=100,
+        )
+
+print(f"Step 1: ${step1.spent:.6f}")
+print(f"Total:  ${total.spent:.6f}")
+```
+
+## Budget Enforcement
+
+```python
+from shekel import BudgetExceededError
+
+try:
+    with budget(
+        max_usd=0.10,
+        price_per_1k_tokens={"input": 0.001, "output": 0.001},
+    ) as b:
+        for _ in range(100):  # Shekel stops this when budget runs out
+            client.chat.completions.create(
+                model="meta-llama/Llama-3.2-1B-Instruct",
+                messages=[{"role": "user", "content": "Analyse this."}],
+                max_tokens=50,
+            )
+except BudgetExceededError as e:
+    print(f"Stopped at ${e.spent:.4f}")
+```
+
+## Free vs Paid Models
+
+HuggingFace offers two tiers for inference:
+
+| Tier | Description | Pricing |
+|---|---|---|
+| Free (Serverless) | Limited RPM, shared infrastructure | Free but rate-limited |
+| PRO / Inference Endpoints | Dedicated infrastructure | Pay per token / per hour |
+
+For most chat models, use `InferenceClient` with an `hf_*` token. Free-tier models may return 503 when overloaded — add retry logic for production use.
+
+## Tips for HuggingFace + Shekel
+
+1. **Always set `price_per_1k_tokens`** — there is no default pricing for HuggingFace models
+2. **Use non-streaming calls for accurate cost tracking** — many models omit usage in streaming
+3. **Check model availability** — not all models are available on HuggingFace's serverless API
+4. **Handle 503 errors** — free-tier endpoints can be temporarily unavailable under load
+5. **Use `max_tokens`** to cap response length and control costs
+
+## Next Steps
+
+- [Google Gemini Integration](gemini.md)
+- [Nested Budgets](../usage/nested-budgets.md)
+- [Budget Enforcement](../usage/budget-enforcement.md)
+- [Extending Shekel](../extending.md)
diff --git a/docs/models.md b/docs/models.md
@@ -28,8 +28,14 @@ These models have zero-dependency pricing built into shekel:
 
 | Model | Input / 1K | Output / 1K | Use Case |
 |-------|-----------|-------------|----------|
-| **gemini-1.5-flash** | $0.0000750 | $0.000300 | Fastest, cheapest |
+| **gemini-2.5-pro** | $0.00125 | $0.01000 | Most capable Gemini |
+| **gemini-2.5-flash** | $0.0000750 | $0.000300 | Fast, cost-efficient |
+| **gemini-2.0-flash** | $0.0000750 | $0.000300 | Latest flash model |
 | **gemini-1.5-pro** | $0.00125 | $0.00500 | Balanced quality/cost |
+| **gemini-1.5-flash** | $0.0000750 | $0.000300 | Fastest, cheapest |
+
+!!! note "Native Gemini SDK support"
+    To track costs when calling Gemini via the `google-genai` SDK directly (not through LiteLLM), install `shekel[gemini]`. See [Google Gemini Integration](integrations/gemini.md).
 
 ## Version Resolution