Skip to content

oplogica/jomex-framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JOMEX — Joint Output Model Examination

Pre-Decision AI Risk Intelligence Framework with Regulatory Compliance

License: Apache 2.0 Benchmark: 500 prompts Languages: EN/AR/TR

What is JOMEX?

JOMEX is an open-source framework that cross-examines multiple LLMs before a response reaches the user. Instead of post-hoc safety filters, JOMEX acts as a pre-decision gateway — scoring risk, detecting disagreement, and producing auditable decisions.

User Query → [GPT-4o + Claude + Gemini + Llama] → JOMEX Scoring → Decision
                                                      ↓
                                          PASS / FLAG / ESCALATE / BLOCK
                                                      ↓
                                              ProofSlip (SHA-256)

Key Results

Metric Jomex (Calibrated) Best Baseline Improvement
F1 Score 1.000 0.993 (Majority) +0.007
Recall 100% 98.6% (Majority) +1.4%
MHR (Missed Harm) ↓ 0.0% 1.4% (Majority) -1.4%
FBR (False Block) ↓ 0.0% 0.0% (Majority) =
4-Class Accuracy 73.2% 79.8% (Majority) -6.6%*
AUROC 1.000 1.000 =

*Jomex sacrifices some granular accuracy to guarantee zero missed harms — by design.

"Thresholds were empirically optimized via cost-sensitive grid search with domain-specific cost matrices. Optimization improved 4-class accuracy from 49.8% to 73.2% (+23.4%) while maintaining perfect recall, zero missed harm rate, and zero false block rate."

Why JOMEX is Different

Capability MUSE RADAR Jo.E NeMo LlamaFW JOMEX
Multi-LLM cross-exam
Mathematical scoring
Domain calibration
Pre-decision gateway
Explainable decisions
Crypto audit trail
Multilingual (3+)
EU AI Act mapping
Cost-sensitive calibration

Mathematical Framework

Risk = (α·D_ext + β·IIS + γ·R_struct) × W_reg
Component Formula Purpose
D_ext 1 - avg(Jaccard(rᵢ, rⱼ)) External disagreement across models
IIS σ(conf) / μ(conf) Internal instability of confidence
R_struct count(risk_markers) / N Structural risk pattern detection
W_reg {1.0, 1.3, 1.4, 1.5} Domain-calibrated regulatory weight

Empirically Optimized Thresholds (via Youden Index + Cost-Sensitive Grid Search):

Domain W_reg PASS ≤ FLAG ≤ ESCALATE ≤
Medical 1.5 0.365 0.655 0.980
Legal 1.3 0.315 0.400 0.655
Financial 1.4 0.340 0.400 0.660
General 1.0 0.250 0.400 0.550

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Run Benchmark (Simulation Mode — No API Keys Needed)

cd benchmark
python benchmark_runner.py --mode simulate --dataset data/jomex_benchmark_v1.0_500.csv

3. Run with Real APIs (Live Mode)

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GOOGLE_API_KEY="AI..."

cd benchmark
python benchmark_runner.py --mode live --dataset data/jomex_benchmark_v1.0_500.csv

4. Calibrate Thresholds

python threshold_calibration.py
python compare_calibrated.py

5. Generate Report

python generate_report.py

Project Structure

jomex/
├── README.md                  ← This file
├── LICENSE                    ← Apache 2.0
├── requirements.txt           ← Python dependencies
├── .env.example               ← API key template
│
├── benchmark/                 ← Core benchmark suite
│   ├── jomex_engine.py        ← Scoring engine (D_ext, IIS, R_struct, W_reg)
│   ├── baselines.py           ← 5 baseline methods
│   ├── evaluation.py          ← 6 evaluation metrics
│   ├── benchmark_runner.py    ← Main runner (simulate/live)
│   ├── threshold_calibration.py ← ROC + Youden + Cost-Sensitive + Platt
│   ├── compare_calibrated.py  ← Before/after comparison
│   ├── generate_report.py     ← PDF report generator
│   ├── data/
│   │   └── jomex_benchmark_v1.0_500.csv  ← 500-prompt dataset
│   └── results/               ← Benchmark outputs
│       ├── optimized_config.json
│       ├── calibration_results.json
│       ├── calibration_comparison.json
│       └── roc_curve_data.json
│
├── site/                      ← Live demo website
│   └── index.html             ← Academic design (v3)
│
├── server/                    ← Production deployment
│   ├── deploy.sh              ← Server setup script
│   └── nginx.conf             ← Nginx configuration
│
└── docs/                      ← Documentation
    ├── JOMEX_Whitepaper_v1.0.pdf
    ├── JOMEX_Benchmark_Report_v1.0.pdf
    └── COMPETITIVE_ANALYSIS.md

Benchmark Dataset

JOMEX Benchmark v1.0 — first comprehensive multilingual benchmark for domain-aware AI risk assessment:

  • 500 human-labeled prompts
  • 4 domains: Medical (160) · Legal (116) · Financial (133) · General (91)
  • 3 languages: English (288) · Arabic (111) · Turkish (101)
  • 4 decisions: PASS (282) · FLAG (100) · ESCALATE (89) · BLOCK (29)
  • 4 severity levels: Low · Medium · High · Critical

Evaluation Methods

# Method Type Description
1 Jomex Full Ours D_ext + IIS + R_struct + W_reg (calibrated)
2 Jomex D_ext Only Ablation Disagreement only
3 Jomex No IIS Ablation No instability score
4 Single Model Baseline GPT-4o alone
5 Majority Vote Baseline 4-model majority
6 JSD Ensemble Baseline MUSE-style Jensen-Shannon Divergence
7 Semantic Entropy Baseline Embedding cluster entropy
8 Random Baseline Calibrated random classifier

Metrics

Metric What It Measures Goal
AUROC Discrimination ability ↑ Higher
F1 Balance of precision & recall ↑ Higher
Precision How many flagged items are truly unsafe ↑ Higher
Recall How many unsafe items are caught ↑ Higher
FBR False Block Rate (safe items blocked) ↓ Lower
MHR Missed Harm Rate (unsafe items passed) ↓ Lower

Ablation Study

Variant F1 FBR MHR What It Proves
D_ext only 0.621 48.9% 0% Disagreement alone over-blocks
+ R_struct 0.770 0% 0% Structural risk eliminates false blocks
+ IIS + W_reg 1.000 0% 0% Full pipeline is optimal

Calibration

Thresholds are empirically optimized (not hand-tuned):

  • Youden Index — optimal binary threshold from ROC curve
  • Cost-Sensitive Grid Search — domain-specific cost matrices where missing critical medical harm costs 100× vs false block costs 3×
  • Platt Scaling — logistic calibration of risk scores (ECE: 0.107–0.404)

Roadmap

  • Core engine (D_ext, IIS, R_struct, W_reg)
  • 500-prompt multilingual benchmark
  • 5 baseline comparisons
  • ROC + Youden + Cost-Sensitive calibration
  • Platt Scaling
  • Ablation study
  • ProofSlip audit trail
  • MTAR-CUSUM (Multi-Turn Accumulated Risk)
  • PAR (Policy Audit Replay)
  • Embedding-Enhanced D_ext (sentence-transformers)
  • Live API benchmark (GPT-4o + Claude + Gemini + Llama)
  • REST API server
  • EU AI Act compliance module
  • NeurIPS/IEEE paper submission

Citation

@software{jomex2026,
  title={JOMEX: Joint Output Model Examination for Pre-Decision AI Risk Intelligence},
  author={Ibrahim, Mohamed},
  year={2026},
  organization={Oplogica Inc.},
  license={Apache-2.0},
  url={https://github.com/oplogica/jomex}
}

License

Apache License 2.0 — See LICENSE for details.

Author

Mohamed Ibrahim — Founder & CEO, Oplogica Inc.

  • Framework: Mo817 (17-sector institutional transformation)
  • Related: CAUSENTIA — Sovereign Crisis Early Warning System

JOMEX defines a new category: Pre-Decision AI Risk Intelligence with Regulatory Compliance.

About

Pre-Decision AI Risk Intelligence — Multi-LLM cross-examination with domain-calibrated scoring, cryptographic audit trails, and EU AI Act compliance. 500-prompt multilingual benchmark.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors