-
Notifications
You must be signed in to change notification settings - Fork 1
Added support for evals compilation checks and auto mode #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Here are some tests that were run using this new logic: # ==================== TESTS ====================
print("=" * 80)
print("TESTING EVAL COMPILATION CHECKS")
print("=" * 80)
# Test 1: Binary mode with Yes/No question (should pass)
print("\n--- Test 1: Binary mode with Yes/No question ---")
mode1, warnings1 = compile_eval(
name="context_sufficiency",
criteria="Does the Document contain 100% of the information needed to answer the Question?",
mode="binary",
query_identifier="Question",
context_identifier="Document"
)
print(f"Result: mode='{mode1}', warnings={len(warnings1)}")
# Test 2: Binary mode with non-Yes/No criteria (should warn)
print("\n--- Test 2: Binary mode with non-Yes/No criteria ---")
mode2, warnings2 = compile_eval(
name="response_quality",
criteria="Assess the quality of the Response based on accuracy and completeness.",
mode="binary",
response_identifier="Response"
)
print(f"Result: mode='{mode2}', warnings={len(warnings2)}")
# Test 3: Continuous mode without good/bad specification (should warn)
print("\n--- Test 3: Continuous mode without good/bad specification ---")
mode3, warnings3 = compile_eval(
name="response_length",
criteria="Evaluate the length of the Response.",
mode="continuous",
response_identifier="Response"
)
print(f"Result: mode='{mode3}', warnings={len(warnings3)}")
# Test 4: Continuous mode with numeric scoring scheme (should warn)
print("\n--- Test 4: Continuous mode with numeric scoring scheme ---")
mode4, warnings4 = compile_eval(
name="relevance",
criteria="Rate the relevance of the Response to the Query on a scale from 1 to 5, where 1 is not relevant and 5 is highly relevant.",
mode="continuous",
query_identifier="Query",
response_identifier="Response"
)
print(f"Result: mode='{mode4}', warnings={len(warnings4)}")
# Test 5: Continuous mode with proper good/bad specification (should pass)
print("\n--- Test 5: Continuous mode with proper good/bad specification ---")
mode5, warnings5 = compile_eval(
name="response_groundedness",
criteria="Assess whether the Response is grounded in the Context. A good Response has all claims supported by the Context. A bad Response makes unsupported claims or introduces information not in the Context.",
mode="continuous",
context_identifier="Context",
response_identifier="Response"
)
print(f"Result: mode='{mode5}', warnings={len(warnings5)}")
# Test 6: Auto mode with Yes/No question (should become binary)
print("\n--- Test 6: Auto mode with Yes/No question ---")
mode6, warnings6 = compile_eval(
name="has_company_mention",
criteria="Does the Response mention ACME Inc.?",
mode="auto",
response_identifier="Response"
)
print(f"Result: mode='{mode6}', warnings={len(warnings6)}")
# Test 7: Auto mode with good/bad specification (should become continuous)
print("\n--- Test 7: Auto mode with good/bad specification ---")
mode7, warnings7 = compile_eval(
name="helpfulness",
criteria="Determine if the Response is helpful. A helpful Response attempts to answer the question. An unhelpful Response avoids answering or deflects.",
mode="auto",
response_identifier="Response"
)
print(f"Result: mode='{mode7}', warnings={len(warnings7)}")
# Test 8: Auto mode with unclear criteria (should warn)
print("\n--- Test 8: Auto mode with unclear criteria ---")
mode8, warnings8 = compile_eval(
name="unclear_eval",
criteria="Evaluate the Response.",
mode="auto",
response_identifier="Response"
)
print(f"Result: mode='{mode8}', warnings={len(warnings8)}")
# Test 9: Continuous mode with Yes/No question (should warn)
print("\n--- Test 9: Continuous mode with Yes/No question ---")
mode9, warnings9 = compile_eval(
name="is_polite",
criteria="Is the Response polite and professional?",
mode="continuous",
response_identifier="Response"
)
print(f"Result: mode='{mode9}', warnings={len(warnings9)}")
print("\n" + "=" * 80)
print("TESTS COMPLETE")
print("=" * 80)and here are the results of the tests, which all pass: ================================================================================
TESTING EVAL COMPILATION CHECKS
================================================================================
--- Test 1: Binary mode with Yes/No question ---
Result: mode='binary', warnings=0
--- Test 2: Binary mode with non-Yes/No criteria ---
/var/folders/9h/mkx7lrxd0xl_61j26qb0jlh00000gn/T/ipykernel_5510/664788559.py:128: UserWarning: Eval 'response_quality': Mode is set to 'binary' but criteria does not appear to be a Yes/No question. Consider rephrasing as a Yes/No question or changing mode to 'continuous'.
warnings.warn(warning_msg, UserWarning)
Result: mode='binary', warnings=1
--- Test 3: Continuous mode without good/bad specification ---
/var/folders/9h/mkx7lrxd0xl_61j26qb0jlh00000gn/T/ipykernel_5510/664788559.py:150: UserWarning: Eval 'response_length': Mode is set to 'continuous' but criteria does not clearly specify what is good/desirable versus bad/undesirable. This may lead to inconsistent or unclear scoring.
warnings.warn(warning_msg, UserWarning)
Result: mode='continuous', warnings=1
--- Test 4: Continuous mode with numeric scoring scheme ---
/var/folders/9h/mkx7lrxd0xl_61j26qb0jlh00000gn/T/ipykernel_5510/664788559.py:162: UserWarning: Eval 'relevance': Mode is set to 'continuous' but criteria already specifies a numeric scoring scheme. TrustworthyRAG will normalize scores to 0-1 range, which may conflict with your specified scoring scheme. Consider removing the numeric scoring scheme from the criteria.
warnings.warn(warning_msg, UserWarning)
Result: mode='continuous', warnings=1
--- Test 5: Continuous mode with proper good/bad specification ---
Result: mode='continuous', warnings=0
--- Test 6: Auto mode with Yes/No question ---
✓ Auto-determined mode: 'binary' for eval 'has_company_mention'
Result: mode='binary', warnings=0
--- Test 7: Auto mode with good/bad specification ---
✓ Auto-determined mode: 'continuous' for eval 'helpfulness'
Result: mode='continuous', warnings=0
--- Test 8: Auto mode with unclear criteria ---
✓ Auto-determined mode: 'continuous' for eval 'unclear_eval'
/var/folders/9h/mkx7lrxd0xl_61j26qb0jlh00000gn/T/ipykernel_5510/664788559.py:116: UserWarning: Eval 'unclear_eval': Criteria does not appear to be a Yes/No question and does not clearly specify what is good/bad or desirable/undesirable. This may result in poor evaluation quality.
warnings.warn(warning_msg, UserWarning)
Result: mode='continuous', warnings=1
--- Test 9: Continuous mode with Yes/No question ---
/var/folders/9h/mkx7lrxd0xl_61j26qb0jlh00000gn/T/ipykernel_5510/664788559.py:139: UserWarning: Eval 'is_polite': Mode is set to 'continuous' but criteria appears to be a Yes/No question. Consider changing mode to 'binary' for more appropriate scoring.
warnings.warn(warning_msg, UserWarning)
/var/folders/9h/mkx7lrxd0xl_61j26qb0jlh00000gn/T/ipykernel_5510/664788559.py:150: UserWarning: Eval 'is_polite': Mode is set to 'continuous' but criteria does not clearly specify what is good/desirable versus bad/undesirable. This may lead to inconsistent or unclear scoring.
warnings.warn(warning_msg, UserWarning)
Result: mode='continuous', warnings=2
================================================================================
TESTS COMPLETE
================================================================================ |
src/cleanlab_tlm/utils/rag.py
Outdated
| from cleanlab_tlm.tlm import TLM | ||
|
|
||
| tlm = TLM(quality_preset="base") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems potentially off, remember I told you this would be the number one challenge.
You need to properly invoke TLM (using the right API key) from inside the TLM package. If you are unsure, then ask for help from Hui Wen.
| For example, specifying `response_identifier` as "AI Answer" means your `criteria` should refer to the response as "AI Answer". | ||
| Leave this value as None (the default) if this Eval doesn't consider the response. | ||
| mode (str, optional): What type of evaluation these `criteria` correspond to, either "continuous" (default) or "binary". | ||
| mode (str, optional): What type of evaluation these `criteria` correspond to, either "continuous" (default), "binary", or "auto". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default should become 'auto' with this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before that change is pushed though (maybe consider saving it for next PR if you're not 100% confident in it), you should:
verify that for all TRAG default Evals, they work fine with 'auto' (ie. it correctly classifies their mode as continuous and does not raise warnings for any of them)
jwmueller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had some changes to recommend, you can get Aditya's review after making those
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
…ia logic and compile mode function
|
@aditya1503 Please take over the rest of this PR. Ask Elias for final review once it's ready |
Adding logic with compilation checks that raise warning when:
Also added support for mode=‘auto’ compilation which automatically determines whether mode should be binary or numeric. This classifier raises warning if criteria looks appropriate for neither.
After adding these changes, when running pre-commit hooks we get the following errors in
tests/test_tlm_rag.pywhich to me seems like we should update these tests in that script (which seem incorrect now):