Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@ jobs:
env:
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
HUGGING_FACE_API: ${{ secrets.HUGGING_FACE_API }}
run: pytest tests/integrations/ -v --cov=shekel --cov-report=xml --cov-append

- name: Upload integration coverage
Expand Down
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added
- **Google Gemini Provider Adapter** (`shekel/providers/gemini.py`) — Native support for the `google-genai` SDK
- Patches `google.genai.models.Models.generate_content` (non-streaming) and `generate_content_stream` (streaming) as two separate methods
- Token extraction from `response.usage_metadata.prompt_token_count` / `candidates_token_count`
- Model name captured from `model` kwarg before the call (not available in Gemini response objects)
- New pricing entries: `gemini-2.0-flash`, `gemini-2.5-flash`, `gemini-2.5-pro`
- Install via `pip install shekel[gemini]`
- **HuggingFace Provider Adapter** (`shekel/providers/huggingface.py`) — Support for `huggingface_hub.InferenceClient`
- Patches `InferenceClient.chat_completion` (the underlying method for `.chat.completions.create`)
- OpenAI-compatible token extraction (`usage.prompt_tokens` / `usage.completion_tokens`)
- Graceful handling when models don't return usage in streaming responses
- Install via `pip install shekel[huggingface]`
- **Integration tests** for both new adapters with real API calls (skip gracefully on quota errors)
- **Examples**: `examples/gemini_demo.py`, `examples/huggingface_demo.py`
- **Documentation**: `docs/integrations/gemini.md`, `docs/integrations/huggingface.md`

## [0.2.5] - 2026-03-11

### Added
Expand Down
11 changes: 11 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# CLAUDE.md — Development Guidelines

## Test File Naming

Tests must be organized **by domain**, not by implementation unit or coverage goal.

- **Good**: `test_openai_wrappers.py`, `test_gemini_wrappers.py`, `test_fallback.py`
- **Bad**: `test_patch_coverage.py`, `test_patching.py`, `test_coverage_for_x.py`

Name test files after the feature or domain being exercised, not after the module
being covered or the motivation for writing the tests.
18 changes: 18 additions & 0 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,22 @@ If you're using models from both providers:
pip install shekel[all]
```

### Google Gemini

For Google Gemini via the `google-genai` SDK:

```bash
pip install shekel[gemini]
```

### HuggingFace Inference API

For HuggingFace's `InferenceClient`:

```bash
pip install shekel[huggingface]
```

### LiteLLM (100+ Providers)

For access to OpenAI, Anthropic, Gemini, Cohere, Ollama, Azure, Bedrock, and 90+ more through a unified interface:
Expand Down Expand Up @@ -107,6 +123,8 @@ Shekel has zero required dependencies beyond the Python standard library. The Op
| `openai>=1.0.0` | Optional | Track OpenAI API costs |
| `anthropic>=0.7.0` | Optional | Track Anthropic API costs |
| `litellm>=1.0.0` | Optional | Track costs via LiteLLM (100+ providers) |
| `google-genai>=1.0.0` | Optional | Track Google Gemini costs (native SDK) |
| `huggingface-hub>=0.20.0` | Optional | Track HuggingFace Inference API costs |
| `tokencost>=0.1.0` | Optional | Support 400+ models |
| `click>=8.0.0` | Optional | CLI tools |

Expand Down
159 changes: 159 additions & 0 deletions docs/integrations/gemini.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# Google Gemini Integration

Shekel tracks costs and enforces budgets for [Google Gemini](https://ai.google.dev/) via the official `google-genai` Python SDK.

## Installation

```bash
pip install shekel[gemini]
```

## Why a dedicated adapter?

Unlike OpenAI and Anthropic, Gemini uses its own SDK (`google-genai`) that makes direct API calls — it does **not** route through the OpenAI SDK. Without a dedicated adapter, `budget()` would be completely blind to Gemini spend.

Shekel's `GeminiAdapter` patches two methods at runtime:

- `google.genai.models.Models.generate_content` — non-streaming calls
- `google.genai.models.Models.generate_content_stream` — streaming calls

All other Shekel features (nested budgets, fallback models, `BudgetExceededError`) work identically.

## Basic Integration

```python
import google.genai as genai
from shekel import budget

client = genai.Client(api_key="your-gemini-key")

with budget(max_usd=1.00) as b:
response = client.models.generate_content(
model="gemini-2.0-flash-lite",
contents="Explain quantum computing in one sentence.",
)
print(response.candidates[0].content.parts[0].text)
print(f"Cost: ${b.spent:.6f}")
```

## Streaming

Gemini streaming uses a **separate method** (`generate_content_stream`) rather than a `stream=True` kwarg — Shekel patches both:

```python
with budget(max_usd=1.00) as b:
for chunk in client.models.generate_content_stream(
model="gemini-2.0-flash-lite",
contents="List three benefits of Python.",
):
if chunk.candidates:
print(chunk.candidates[0].content.parts[0].text, end="", flush=True)
print()
print(f"Cost: ${b.spent:.6f}")
```

## Nested Budgets

Track costs across multi-step Gemini workflows:

```python
with budget(max_usd=5.00, name="pipeline") as total:
with budget(max_usd=1.00, name="research") as research:
client.models.generate_content(
model="gemini-2.0-flash-lite",
contents="Summarise recent AI trends.",
)

with budget(max_usd=2.00, name="analysis") as analysis:
client.models.generate_content(
model="gemini-2.0-flash",
contents="Analyse the implications of those trends.",
)

print(f"Research: ${research.spent:.6f}")
print(f"Analysis: ${analysis.spent:.6f}")
print(f"Total: ${total.spent:.6f}")
print(total.tree())
```

## Fallback Models

Switch to a cheaper Gemini model when spend reaches a threshold:

```python
with budget(
max_usd=0.50,
fallback={"at_pct": 0.8, "model": "gemini-2.0-flash-lite"},
) as b:
# Starts with gemini-2.0-flash; auto-switches at 80% ($0.40)
response = client.models.generate_content(
model="gemini-2.0-flash",
contents="Write a detailed market analysis.",
)

if b.model_switched:
print(f"Switched to fallback at ${b.switched_at_usd:.4f}")
```

!!! note "Same-provider fallback only"
Fallback must be another Gemini model. Cross-provider fallback (e.g. Gemini → GPT-4o) is not supported.

## Budget Enforcement

Stop a runaway Gemini loop automatically:

```python
from shekel import BudgetExceededError

try:
with budget(max_usd=2.00) as b:
for _ in range(100): # Shekel stops this when budget runs out
client.models.generate_content(
model="gemini-2.0-flash-lite",
contents="Analyse this document.",
)
except BudgetExceededError as e:
print(f"Stopped at ${e.spent:.4f} — saved the rest of the budget.")
```

## Supported Models and Pricing

| Model | Input (per 1k tokens) | Output (per 1k tokens) |
|---|---|---|
| `gemini-2.5-pro` | $0.00125 | $0.01000 |
| `gemini-2.5-flash` | $0.000075 | $0.00030 |
| `gemini-2.0-flash` | $0.000075 | $0.00030 |
| `gemini-2.0-flash-lite` | $0.000075 | $0.00030 |
| `gemini-1.5-pro` | $0.00125 | $0.00500 |
| `gemini-1.5-flash` | $0.000075 | $0.00030 |

Shekel uses prefix matching, so `gemini-2.0-flash-001` and similar versioned names resolve automatically.

## Custom Pricing

For models not in the pricing table, pass `price_per_1k_tokens`:

```python
with budget(
max_usd=1.00,
price_per_1k_tokens={"input": 0.0001, "output": 0.0003},
) as b:
client.models.generate_content(
model="gemini-3-flash-preview",
contents="Hello.",
)
```

## Tips for Gemini + Shekel

1. **Use `generate_content_stream` for long responses** — streaming lets you stop mid-generation if the budget is hit
2. **Wrap at the workflow level**, not per-call, for accurate total cost tracking
3. **Set `warn_at=0.8`** to log a warning before the budget cap triggers
4. **Gemini free tier has per-minute limits** — use exponential backoff for production workloads

## Next Steps

- [HuggingFace Integration](huggingface.md)
- [Nested Budgets](../usage/nested-budgets.md)
- [Fallback Models](../usage/fallback-models.md)
- [Extending Shekel](../extending.md)
144 changes: 144 additions & 0 deletions docs/integrations/huggingface.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# HuggingFace Integration

Shekel tracks costs and enforces budgets for [HuggingFace Inference API](https://huggingface.co/docs/inference-providers/en/index) via the `huggingface-hub` Python SDK's `InferenceClient`.

## Installation

```bash
pip install shekel[huggingface]
```

## Why a dedicated adapter?

HuggingFace's `InferenceClient` uses its own HTTP layer — it does **not** call the OpenAI SDK under the hood. Without a dedicated adapter, `budget()` would be completely blind to HuggingFace spend.

Shekel's `HuggingFaceAdapter` patches `InferenceClient.chat_completion` at runtime. Since `client.chat.completions.create()` delegates to `chat_completion` internally, all calls through either interface are tracked automatically.

## Important: Custom Pricing Required

!!! warning "No bundled HuggingFace pricing"
HuggingFace hosts thousands of models with varying pricing. Shekel has no standard pricing table for HuggingFace models.

**Always pass `price_per_1k_tokens` to `budget()`** so Shekel can calculate costs:

```python
with budget(max_usd=1.00, price_per_1k_tokens={"input": 0.001, "output": 0.001}):
...
```

If you omit this, `b.spent` will always be `0.0` even though tokens were consumed.

## Basic Integration

```python
from huggingface_hub import InferenceClient
from shekel import budget

client = InferenceClient(token="your-hf-token")

with budget(
max_usd=1.00,
price_per_1k_tokens={"input": 0.001, "output": 0.001},
) as b:
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-1B-Instruct",
messages=[{"role": "user", "content": "Explain transformers in one sentence."}],
max_tokens=50,
)
print(response.choices[0].message.content)
print(f"Cost: ${b.spent:.6f}")
```

## Streaming

```python
with budget(
max_usd=1.00,
price_per_1k_tokens={"input": 0.001, "output": 0.001},
) as b:
full_text = ""
for chunk in client.chat.completions.create(
model="meta-llama/Llama-3.2-1B-Instruct",
messages=[{"role": "user", "content": "List three ML frameworks."}],
max_tokens=60,
stream=True,
):
delta = chunk.choices[0].delta.content
if delta:
full_text += delta
print(delta, end="", flush=True)
print()
print(f"Cost: ${b.spent:.6f}")
```

!!! note "Streaming usage availability"
Many HuggingFace-hosted models do not return `usage` data in streaming chunks. In that case, `b.spent` will be `0.0` for streaming calls even if tokens were consumed. Non-streaming calls generally do return usage data.

## Nested Budgets

```python
with budget(
max_usd=5.00,
name="pipeline",
price_per_1k_tokens={"input": 0.001, "output": 0.001},
) as total:
with budget(
max_usd=1.00,
name="step-1",
price_per_1k_tokens={"input": 0.001, "output": 0.001},
) as step1:
client.chat.completions.create(
model="meta-llama/Llama-3.2-1B-Instruct",
messages=[{"role": "user", "content": "Summarise this document."}],
max_tokens=100,
)

print(f"Step 1: ${step1.spent:.6f}")
print(f"Total: ${total.spent:.6f}")
```

## Budget Enforcement

```python
from shekel import BudgetExceededError

try:
with budget(
max_usd=0.10,
price_per_1k_tokens={"input": 0.001, "output": 0.001},
) as b:
for _ in range(100): # Shekel stops this when budget runs out
client.chat.completions.create(
model="meta-llama/Llama-3.2-1B-Instruct",
messages=[{"role": "user", "content": "Analyse this."}],
max_tokens=50,
)
except BudgetExceededError as e:
print(f"Stopped at ${e.spent:.4f}")
```

## Free vs Paid Models

HuggingFace offers two tiers for inference:

| Tier | Description | Pricing |
|---|---|---|
| Free (Serverless) | Limited RPM, shared infrastructure | Free but rate-limited |
| PRO / Inference Endpoints | Dedicated infrastructure | Pay per token / per hour |

For most chat models, use `InferenceClient` with an `hf_*` token. Free-tier models may return 503 when overloaded — add retry logic for production use.

## Tips for HuggingFace + Shekel

1. **Always set `price_per_1k_tokens`** — there is no default pricing for HuggingFace models
2. **Use non-streaming calls for accurate cost tracking** — many models omit usage in streaming
3. **Check model availability** — not all models are available on HuggingFace's serverless API
4. **Handle 503 errors** — free-tier endpoints can be temporarily unavailable under load
5. **Use `max_tokens`** to cap response length and control costs

## Next Steps

- [Google Gemini Integration](gemini.md)
- [Nested Budgets](../usage/nested-budgets.md)
- [Budget Enforcement](../usage/budget-enforcement.md)
- [Extending Shekel](../extending.md)
8 changes: 7 additions & 1 deletion docs/models.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,14 @@ These models have zero-dependency pricing built into shekel:

| Model | Input / 1K | Output / 1K | Use Case |
|-------|-----------|-------------|----------|
| **gemini-1.5-flash** | $0.0000750 | $0.000300 | Fastest, cheapest |
| **gemini-2.5-pro** | $0.00125 | $0.01000 | Most capable Gemini |
| **gemini-2.5-flash** | $0.0000750 | $0.000300 | Fast, cost-efficient |
| **gemini-2.0-flash** | $0.0000750 | $0.000300 | Latest flash model |
| **gemini-1.5-pro** | $0.00125 | $0.00500 | Balanced quality/cost |
| **gemini-1.5-flash** | $0.0000750 | $0.000300 | Fastest, cheapest |

!!! note "Native Gemini SDK support"
To track costs when calling Gemini via the `google-genai` SDK directly (not through LiteLLM), install `shekel[gemini]`. See [Google Gemini Integration](integrations/gemini.md).

## Version Resolution

Expand Down
Loading
Loading