Skip to content

Conversation

@mturk24
Copy link
Contributor

@mturk24 mturk24 commented Nov 7, 2025

Adding logic with compilation checks that raise warning when:

  • mode=binary, but criteria is not Yes/No question
  • mode=continuous, but criteria does not clearly specify what is Good vs Bad or Desirable vs Undesirable
  • mode=continuous, but criteria already specifies a specific numeric scoring scheme.

Also added support for mode=‘auto’ compilation which automatically determines whether mode should be binary or numeric. This classifier raises warning if criteria looks appropriate for neither.

After adding these changes, when running pre-commit hooks we get the following errors in tests/test_tlm_rag.py which to me seems like we should update these tests in that script (which seem incorrect now):

format...................................................................Failed
- hook id: format
- files were modified by this hook

cmd [1] | ruff check --config '/Users/mturk/Library/Application Support/hatch/env/.internal/hatch-static-analysis/.config/smoa--PI/pyproject.toml' --fix .
Found 2 errors (2 fixed, 0 remaining).
cmd [2] | ruff format --config '/Users/mturk/Library/Application Support/hatch/env/.internal/hatch-static-analysis/.config/smoa--PI/pyproject.toml' .
43 files left unchanged

type-check...............................................................Failed
- hook id: type-check
- exit code: 1

tests/test_tlm_rag.py:1221: error: Non-overlapping container check (element type: "str", container item type: "EvalMetric")  [comparison-overlap]
tests/test_tlm_rag.py:1222: error: Unused "type: ignore" comment  [unused-ignore]
tests/test_tlm_rag.py:1222: error: Invalid index type "str" for "dict[EvalMetric, TrustworthyRAGScore]"; expected type "EvalMetric"  [index]
tests/test_tlm_rag.py:1222: note: Error code "index" not covered by "type: ignore" comment
Found 3 errors in 1 file (checked 43 source files)

@mturk24
Copy link
Contributor Author

mturk24 commented Nov 7, 2025

Here are some tests that were run using this new logic:

# ==================== TESTS ====================

print("=" * 80)
print("TESTING EVAL COMPILATION CHECKS")
print("=" * 80)

# Test 1: Binary mode with Yes/No question (should pass)
print("\n--- Test 1: Binary mode with Yes/No question ---")
mode1, warnings1 = compile_eval(
    name="context_sufficiency",
    criteria="Does the Document contain 100% of the information needed to answer the Question?",
    mode="binary",
    query_identifier="Question",
    context_identifier="Document"
)
print(f"Result: mode='{mode1}', warnings={len(warnings1)}")

# Test 2: Binary mode with non-Yes/No criteria (should warn)
print("\n--- Test 2: Binary mode with non-Yes/No criteria ---")
mode2, warnings2 = compile_eval(
    name="response_quality",
    criteria="Assess the quality of the Response based on accuracy and completeness.",
    mode="binary",
    response_identifier="Response"
)
print(f"Result: mode='{mode2}', warnings={len(warnings2)}")

# Test 3: Continuous mode without good/bad specification (should warn)
print("\n--- Test 3: Continuous mode without good/bad specification ---")
mode3, warnings3 = compile_eval(
    name="response_length",
    criteria="Evaluate the length of the Response.",
    mode="continuous",
    response_identifier="Response"
)
print(f"Result: mode='{mode3}', warnings={len(warnings3)}")

# Test 4: Continuous mode with numeric scoring scheme (should warn)
print("\n--- Test 4: Continuous mode with numeric scoring scheme ---")
mode4, warnings4 = compile_eval(
    name="relevance",
    criteria="Rate the relevance of the Response to the Query on a scale from 1 to 5, where 1 is not relevant and 5 is highly relevant.",
    mode="continuous",
    query_identifier="Query",
    response_identifier="Response"
)
print(f"Result: mode='{mode4}', warnings={len(warnings4)}")

# Test 5: Continuous mode with proper good/bad specification (should pass)
print("\n--- Test 5: Continuous mode with proper good/bad specification ---")
mode5, warnings5 = compile_eval(
    name="response_groundedness",
    criteria="Assess whether the Response is grounded in the Context. A good Response has all claims supported by the Context. A bad Response makes unsupported claims or introduces information not in the Context.",
    mode="continuous",
    context_identifier="Context",
    response_identifier="Response"
)
print(f"Result: mode='{mode5}', warnings={len(warnings5)}")

# Test 6: Auto mode with Yes/No question (should become binary)
print("\n--- Test 6: Auto mode with Yes/No question ---")
mode6, warnings6 = compile_eval(
    name="has_company_mention",
    criteria="Does the Response mention ACME Inc.?",
    mode="auto",
    response_identifier="Response"
)
print(f"Result: mode='{mode6}', warnings={len(warnings6)}")

# Test 7: Auto mode with good/bad specification (should become continuous)
print("\n--- Test 7: Auto mode with good/bad specification ---")
mode7, warnings7 = compile_eval(
    name="helpfulness",
    criteria="Determine if the Response is helpful. A helpful Response attempts to answer the question. An unhelpful Response avoids answering or deflects.",
    mode="auto",
    response_identifier="Response"
)
print(f"Result: mode='{mode7}', warnings={len(warnings7)}")

# Test 8: Auto mode with unclear criteria (should warn)
print("\n--- Test 8: Auto mode with unclear criteria ---")
mode8, warnings8 = compile_eval(
    name="unclear_eval",
    criteria="Evaluate the Response.",
    mode="auto",
    response_identifier="Response"
)
print(f"Result: mode='{mode8}', warnings={len(warnings8)}")

# Test 9: Continuous mode with Yes/No question (should warn)
print("\n--- Test 9: Continuous mode with Yes/No question ---")
mode9, warnings9 = compile_eval(
    name="is_polite",
    criteria="Is the Response polite and professional?",
    mode="continuous",
    response_identifier="Response"
)
print(f"Result: mode='{mode9}', warnings={len(warnings9)}")

print("\n" + "=" * 80)
print("TESTS COMPLETE")
print("=" * 80)

and here are the results of the tests, which all pass:

================================================================================
TESTING EVAL COMPILATION CHECKS
================================================================================

--- Test 1: Binary mode with Yes/No question ---
Result: mode='binary', warnings=0

--- Test 2: Binary mode with non-Yes/No criteria ---
/var/folders/9h/mkx7lrxd0xl_61j26qb0jlh00000gn/T/ipykernel_5510/664788559.py:128: UserWarning: Eval 'response_quality': Mode is set to 'binary' but criteria does not appear to be a Yes/No question. Consider rephrasing as a Yes/No question or changing mode to 'continuous'.
  warnings.warn(warning_msg, UserWarning)
Result: mode='binary', warnings=1

--- Test 3: Continuous mode without good/bad specification ---
/var/folders/9h/mkx7lrxd0xl_61j26qb0jlh00000gn/T/ipykernel_5510/664788559.py:150: UserWarning: Eval 'response_length': Mode is set to 'continuous' but criteria does not clearly specify what is good/desirable versus bad/undesirable. This may lead to inconsistent or unclear scoring.
  warnings.warn(warning_msg, UserWarning)
Result: mode='continuous', warnings=1

--- Test 4: Continuous mode with numeric scoring scheme ---
/var/folders/9h/mkx7lrxd0xl_61j26qb0jlh00000gn/T/ipykernel_5510/664788559.py:162: UserWarning: Eval 'relevance': Mode is set to 'continuous' but criteria already specifies a numeric scoring scheme. TrustworthyRAG will normalize scores to 0-1 range, which may conflict with your specified scoring scheme. Consider removing the numeric scoring scheme from the criteria.
  warnings.warn(warning_msg, UserWarning)
Result: mode='continuous', warnings=1

--- Test 5: Continuous mode with proper good/bad specification ---
Result: mode='continuous', warnings=0

--- Test 6: Auto mode with Yes/No question ---Auto-determined mode: 'binary' for eval 'has_company_mention'
Result: mode='binary', warnings=0

--- Test 7: Auto mode with good/bad specification ---Auto-determined mode: 'continuous' for eval 'helpfulness'
Result: mode='continuous', warnings=0

--- Test 8: Auto mode with unclear criteria ---Auto-determined mode: 'continuous' for eval 'unclear_eval'
/var/folders/9h/mkx7lrxd0xl_61j26qb0jlh00000gn/T/ipykernel_5510/664788559.py:116: UserWarning: Eval 'unclear_eval': Criteria does not appear to be a Yes/No question and does not clearly specify what is good/bad or desirable/undesirable. This may result in poor evaluation quality.
  warnings.warn(warning_msg, UserWarning)
Result: mode='continuous', warnings=1

--- Test 9: Continuous mode with Yes/No question ---
/var/folders/9h/mkx7lrxd0xl_61j26qb0jlh00000gn/T/ipykernel_5510/664788559.py:139: UserWarning: Eval 'is_polite': Mode is set to 'continuous' but criteria appears to be a Yes/No question. Consider changing mode to 'binary' for more appropriate scoring.
  warnings.warn(warning_msg, UserWarning)
/var/folders/9h/mkx7lrxd0xl_61j26qb0jlh00000gn/T/ipykernel_5510/664788559.py:150: UserWarning: Eval 'is_polite': Mode is set to 'continuous' but criteria does not clearly specify what is good/desirable versus bad/undesirable. This may lead to inconsistent or unclear scoring.
  warnings.warn(warning_msg, UserWarning)
Result: mode='continuous', warnings=2

================================================================================
TESTS COMPLETE
================================================================================

@mturk24 mturk24 requested a review from aditya1503 November 7, 2025 21:46
Comment on lines 999 to 1001
from cleanlab_tlm.tlm import TLM

tlm = TLM(quality_preset="base")
Copy link
Member

@jwmueller jwmueller Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems potentially off, remember I told you this would be the number one challenge.

You need to properly invoke TLM (using the right API key) from inside the TLM package. If you are unsure, then ask for help from Hui Wen.

For example, specifying `response_identifier` as "AI Answer" means your `criteria` should refer to the response as "AI Answer".
Leave this value as None (the default) if this Eval doesn't consider the response.
mode (str, optional): What type of evaluation these `criteria` correspond to, either "continuous" (default) or "binary".
mode (str, optional): What type of evaluation these `criteria` correspond to, either "continuous" (default), "binary", or "auto".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default should become 'auto' with this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before that change is pushed though (maybe consider saving it for next PR if you're not 100% confident in it), you should:

verify that for all TRAG default Evals, they work fine with 'auto' (ie. it correctly classifies their mode as continuous and does not raise warnings for any of them)

Copy link
Member

@jwmueller jwmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had some changes to recommend, you can get Aditya's review after making those

@jwmueller jwmueller removed the request for review from aditya1503 November 7, 2025 22:43
aditya1503 and others added 5 commits November 8, 2025 07:53
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
@jwmueller
Copy link
Member

@aditya1503 Please take over the rest of this PR. Ask Elias for final review once it's ready

@aditya1503 aditya1503 merged commit c45e7cb into add_binary Nov 24, 2025
@aditya1503 aditya1503 deleted the add-eval-compilation-checks branch November 24, 2025 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants