HTTP proxy that sits between end users and LLM servers and bills them per token.
- Python >=3.11
- Python packages from pyproject.toml (dependencies)
python3 -m venv venv
source venv/bin/activate
sqlite3 db.sqlite < llmproxy/schema.sql
python3 -m llmproxySee llmproxy/config.toml for an example configuration file. The program will update the list of configured backends from the config file on SIGHUP.
pkill -f -HUP 'python3? .*llmproxy'
# or in Docker
docker kill -s=SIGHUP CONTAINERpython3 -m unittestTo build the Docker image, run the following command:
docker build -t ghcr.io/comtegra/llmproxy:master .The following instructions assume that you already have a Docker compose repository and file.
- Create an SQLite database according to llmproxy/schema.sql.
sqlite3 db.sqlite < llmproxy/schema.sql- Copy llmproxy/config.toml to your repository. A good
relative path would be
secrets/llm-billing-proxy-config.toml. - Add appropriate entries to your compose file's
servicesandsecretssections. There's a template compose.yml in this repository. - Bring the new service up and verify it started correctly
(e.g
docker compose up llm-billing-proxy).
- Create a MongoDB user with the following privileges (see section Database below for JS snippets):
{resource: {db: "cgc", collection: "api_keys"}, actions: ["find"]}{resource: {db: "billing", collection: "events_oneoff"}, actions: ["insert"]}
- Copy llmproxy/config.toml to your repository. A good
relative path would be
secrets/llm-billing-proxy-config.toml. - In the config edit
uriin sectiondbso that it includes credentials for the Mongo user (e.g.mongodb://myuser:mypass@host:27017/?authSource=cgc). - Add appropriate entries to your compose file's
servicesandsecretssections. There's a template compose.yml in this repository. - Bring the new service up and verify it started correctly
(e.g
docker compose up llm-billing-proxy).
Sample data can be added to an SQLite database by using sample.sql. Ditto for MongoDB in sample.js.
After loading sample data you may run a query like this:
curl -v -H'Content-Type: application/json' -H'Authorization: Bearer token2' \
-d'{"messages": [{"role": "system", "content": "You are an assistant."}, {"role": "user", "content": "Write a limerick about python exceptions"}], "model": "llama31-70b", "stream": true}' \
'http://localhost:8080/v1/chat/completions'If using SQLite see llmproxy/schema.sql. Otherwise read on.
This program uses MongoDB for authentication and completion logging. The schema is compatible with Comtergra GPU Core. The MongoDB user needs the following privileges:
db.getSiblingDB("cgc").createRole({
role: "apiKeysReader",
privileges: [
{
resource: {db: "cgc", collection: "api_keys"},
actions: ["find"],
},
],
roles: [],
});
db.getSiblingDB("billing").createRole({
role: "completionBillingWriter",
privileges: [
{
resource: {db: "billing", collection: "events_oneoff"},
actions: ["insert"],
},
],
roles: [],
});
db.getSiblingDB("cgc").createUser({
user: "llm-billing-proxy",
pwd: "mypass",
roles: [
{ role: "completionBillingWriter", db: "cgc" },
{ role: "apiKeysReader", db: "billing" },
],
});Users authenticate via bearer tokens. Tokens are stored in the database as SHA256 hashes. They can be generated by the following command:
python3 -c 'import hashlib, secrets; print("Token:", t:=secrets.token_urlsafe(64)); print("Hash:", hashlib.sha256(t.encode()).hexdigest())'If using SQLite see llmproxy/schema.sql. Otherwise read on.
When a user attempts to authenticate, the following query is performed to the
cgc.api_keys collection.
{
"access_level": "LLM",
"secret": TOKEN-HASH,
"$or": [
{"date_expiry": {"$gt": new Date()}},
{"date_expiry": null},
],
}TOKEN-HASH is replaced with the SHA256 hash of the bearer token.
The documents are required to have an additional field: user_id.
If using SQLite see llmproxy/schema.sql. Otherwise read on.
When a user performs a completion, two events (prompt token and completion
token counts) are inserted into the collection billing.events_oneoff.
These documents have the following fields:
date_created-- date and time when the request finished processinguser_id--user_idof the API keyapi_key_id-- id of the API keyproduct-- a string in the following format:MODEL/DEVICE/TYPE, whereMODELis the name of the backendDEVICEis the name of the GPU where the model runsTYPEispromptorcompletion
quantity-- token countrequest_id-- request ID to correlate prompt/completion counts