Added support for evals compilation checks and auto mode #131

mturk24 · 2025-11-07T21:44:05Z

Adding logic with compilation checks that raise warning when:

mode=binary, but criteria is not Yes/No question
mode=continuous, but criteria does not clearly specify what is Good vs Bad or Desirable vs Undesirable
mode=continuous, but criteria already specifies a specific numeric scoring scheme.

Also added support for mode=‘auto’ compilation which automatically determines whether mode should be binary or numeric. This classifier raises warning if criteria looks appropriate for neither.

After adding these changes, when running pre-commit hooks we get the following errors in tests/test_tlm_rag.py which to me seems like we should update these tests in that script (which seem incorrect now):

format...................................................................Failed
- hook id: format
- files were modified by this hook

cmd [1] | ruff check --config '/Users/mturk/Library/Application Support/hatch/env/.internal/hatch-static-analysis/.config/smoa--PI/pyproject.toml' --fix .
Found 2 errors (2 fixed, 0 remaining).
cmd [2] | ruff format --config '/Users/mturk/Library/Application Support/hatch/env/.internal/hatch-static-analysis/.config/smoa--PI/pyproject.toml' .
43 files left unchanged

type-check...............................................................Failed
- hook id: type-check
- exit code: 1

tests/test_tlm_rag.py:1221: error: Non-overlapping container check (element type: "str", container item type: "EvalMetric")  [comparison-overlap]
tests/test_tlm_rag.py:1222: error: Unused "type: ignore" comment  [unused-ignore]
tests/test_tlm_rag.py:1222: error: Invalid index type "str" for "dict[EvalMetric, TrustworthyRAGScore]"; expected type "EvalMetric"  [index]
tests/test_tlm_rag.py:1222: note: Error code "index" not covered by "type: ignore" comment
Found 3 errors in 1 file (checked 43 source files)

mturk24 · 2025-11-07T21:45:55Z

Here are some tests that were run using this new logic:

# ==================== TESTS ====================

print("=" * 80)
print("TESTING EVAL COMPILATION CHECKS")
print("=" * 80)

# Test 1: Binary mode with Yes/No question (should pass)
print("\n--- Test 1: Binary mode with Yes/No question ---")
mode1, warnings1 = compile_eval(
    name="context_sufficiency",
    criteria="Does the Document contain 100% of the information needed to answer the Question?",
    mode="binary",
    query_identifier="Question",
    context_identifier="Document"
)
print(f"Result: mode='{mode1}', warnings={len(warnings1)}")

# Test 2: Binary mode with non-Yes/No criteria (should warn)
print("\n--- Test 2: Binary mode with non-Yes/No criteria ---")
mode2, warnings2 = compile_eval(
    name="response_quality",
    criteria="Assess the quality of the Response based on accuracy and completeness.",
    mode="binary",
    response_identifier="Response"
)
print(f"Result: mode='{mode2}', warnings={len(warnings2)}")

# Test 3: Continuous mode without good/bad specification (should warn)
print("\n--- Test 3: Continuous mode without good/bad specification ---")
mode3, warnings3 = compile_eval(
    name="response_length",
    criteria="Evaluate the length of the Response.",
    mode="continuous",
    response_identifier="Response"
)
print(f"Result: mode='{mode3}', warnings={len(warnings3)}")

# Test 4: Continuous mode with numeric scoring scheme (should warn)
print("\n--- Test 4: Continuous mode with numeric scoring scheme ---")
mode4, warnings4 = compile_eval(
    name="relevance",
    criteria="Rate the relevance of the Response to the Query on a scale from 1 to 5, where 1 is not relevant and 5 is highly relevant.",
    mode="continuous",
    query_identifier="Query",
    response_identifier="Response"
)
print(f"Result: mode='{mode4}', warnings={len(warnings4)}")

# Test 5: Continuous mode with proper good/bad specification (should pass)
print("\n--- Test 5: Continuous mode with proper good/bad specification ---")
mode5, warnings5 = compile_eval(
    name="response_groundedness",
    criteria="Assess whether the Response is grounded in the Context. A good Response has all claims supported by the Context. A bad Response makes unsupported claims or introduces information not in the Context.",
    mode="continuous",
    context_identifier="Context",
    response_identifier="Response"
)
print(f"Result: mode='{mode5}', warnings={len(warnings5)}")

# Test 6: Auto mode with Yes/No question (should become binary)
print("\n--- Test 6: Auto mode with Yes/No question ---")
mode6, warnings6 = compile_eval(
    name="has_company_mention",
    criteria="Does the Response mention ACME Inc.?",
    mode="auto",
    response_identifier="Response"
)
print(f"Result: mode='{mode6}', warnings={len(warnings6)}")

# Test 7: Auto mode with good/bad specification (should become continuous)
print("\n--- Test 7: Auto mode with good/bad specification ---")
mode7, warnings7 = compile_eval(
    name="helpfulness",
    criteria="Determine if the Response is helpful. A helpful Response attempts to answer the question. An unhelpful Response avoids answering or deflects.",
    mode="auto",
    response_identifier="Response"
)
print(f"Result: mode='{mode7}', warnings={len(warnings7)}")

# Test 8: Auto mode with unclear criteria (should warn)
print("\n--- Test 8: Auto mode with unclear criteria ---")
mode8, warnings8 = compile_eval(
    name="unclear_eval",
    criteria="Evaluate the Response.",
    mode="auto",
    response_identifier="Response"
)
print(f"Result: mode='{mode8}', warnings={len(warnings8)}")

# Test 9: Continuous mode with Yes/No question (should warn)
print("\n--- Test 9: Continuous mode with Yes/No question ---")
mode9, warnings9 = compile_eval(
    name="is_polite",
    criteria="Is the Response polite and professional?",
    mode="continuous",
    response_identifier="Response"
)
print(f"Result: mode='{mode9}', warnings={len(warnings9)}")

print("\n" + "=" * 80)
print("TESTS COMPLETE")
print("=" * 80)

and here are the results of the tests, which all pass:

================================================================================
TESTING EVAL COMPILATION CHECKS
================================================================================

--- Test 1: Binary mode with Yes/No question ---
Result: mode='binary', warnings=0

--- Test 2: Binary mode with non-Yes/No criteria ---
/var/folders/9h/mkx7lrxd0xl_61j26qb0jlh00000gn/T/ipykernel_5510/664788559.py:128: UserWarning: Eval 'response_quality': Mode is set to 'binary' but criteria does not appear to be a Yes/No question. Consider rephrasing as a Yes/No question or changing mode to 'continuous'.
  warnings.warn(warning_msg, UserWarning)
Result: mode='binary', warnings=1

--- Test 3: Continuous mode without good/bad specification ---
/var/folders/9h/mkx7lrxd0xl_61j26qb0jlh00000gn/T/ipykernel_5510/664788559.py:150: UserWarning: Eval 'response_length': Mode is set to 'continuous' but criteria does not clearly specify what is good/desirable versus bad/undesirable. This may lead to inconsistent or unclear scoring.
  warnings.warn(warning_msg, UserWarning)
Result: mode='continuous', warnings=1

--- Test 4: Continuous mode with numeric scoring scheme ---
/var/folders/9h/mkx7lrxd0xl_61j26qb0jlh00000gn/T/ipykernel_5510/664788559.py:162: UserWarning: Eval 'relevance': Mode is set to 'continuous' but criteria already specifies a numeric scoring scheme. TrustworthyRAG will normalize scores to 0-1 range, which may conflict with your specified scoring scheme. Consider removing the numeric scoring scheme from the criteria.
  warnings.warn(warning_msg, UserWarning)
Result: mode='continuous', warnings=1

--- Test 5: Continuous mode with proper good/bad specification ---
Result: mode='continuous', warnings=0

--- Test 6: Auto mode with Yes/No question ---
✓ Auto-determined mode: 'binary' for eval 'has_company_mention'
Result: mode='binary', warnings=0

--- Test 7: Auto mode with good/bad specification ---
✓ Auto-determined mode: 'continuous' for eval 'helpfulness'
Result: mode='continuous', warnings=0

--- Test 8: Auto mode with unclear criteria ---
✓ Auto-determined mode: 'continuous' for eval 'unclear_eval'
/var/folders/9h/mkx7lrxd0xl_61j26qb0jlh00000gn/T/ipykernel_5510/664788559.py:116: UserWarning: Eval 'unclear_eval': Criteria does not appear to be a Yes/No question and does not clearly specify what is good/bad or desirable/undesirable. This may result in poor evaluation quality.
  warnings.warn(warning_msg, UserWarning)
Result: mode='continuous', warnings=1

--- Test 9: Continuous mode with Yes/No question ---
/var/folders/9h/mkx7lrxd0xl_61j26qb0jlh00000gn/T/ipykernel_5510/664788559.py:139: UserWarning: Eval 'is_polite': Mode is set to 'continuous' but criteria appears to be a Yes/No question. Consider changing mode to 'binary' for more appropriate scoring.
  warnings.warn(warning_msg, UserWarning)
/var/folders/9h/mkx7lrxd0xl_61j26qb0jlh00000gn/T/ipykernel_5510/664788559.py:150: UserWarning: Eval 'is_polite': Mode is set to 'continuous' but criteria does not clearly specify what is good/desirable versus bad/undesirable. This may lead to inconsistent or unclear scoring.
  warnings.warn(warning_msg, UserWarning)
Result: mode='continuous', warnings=2

================================================================================
TESTS COMPLETE
================================================================================

src/cleanlab_tlm/utils/rag.py

jwmueller · 2025-11-07T22:16:03Z

src/cleanlab_tlm/utils/rag.py

+        from cleanlab_tlm.tlm import TLM
+
+        tlm = TLM(quality_preset="base")


this seems potentially off, remember I told you this would be the number one challenge.

You need to properly invoke TLM (using the right API key) from inside the TLM package. If you are unsure, then ask for help from Hui Wen.

jwmueller · 2025-11-07T22:19:06Z

src/cleanlab_tlm/utils/rag.py

            For example, specifying `response_identifier` as "AI Answer" means your `criteria` should refer to the response as "AI Answer".
            Leave this value as None (the default) if this Eval doesn't consider the response.
-        mode (str, optional): What type of evaluation these `criteria` correspond to, either "continuous" (default) or "binary".
+        mode (str, optional): What type of evaluation these `criteria` correspond to, either "continuous" (default), "binary", or "auto".


default should become 'auto' with this PR

before that change is pushed though (maybe consider saving it for next PR if you're not 100% confident in it), you should:

verify that for all TRAG default Evals, they work fine with 'auto' (ie. it correctly classifies their mode as continuous and does not raise warnings for any of them)

jwmueller

had some changes to recommend, you can get Aditya's review after making those

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

…ia logic and compile mode function

jwmueller · 2025-11-17T19:24:05Z

@aditya1503 Please take over the rest of this PR. Ask Elias for final review once it's ready

Added support for evals compilation checks and auto mode

45e650f

mturk24 requested a review from aditya1503 November 7, 2025 21:46