-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
bugSomething isn't workingSomething isn't workingenhancementNew feature or requestNew feature or requestprompt-engineering
Description
There are several recurring issues across different questions. They ALL need to be addressed via a careful audit of EVERY question:
| Problem | Example or elaboration | Consequence | Proposed solution |
|---|---|---|---|
| Question text is too long and/or complex | Some questions include extraneous detail (outside of the scope of the to-be-tested concept) that makes parsing more difficult without improving signal | This leads to testing reading comprehension instead of the desired conceptual content | Audit long questions and simplify/reword to focus on the desired concept. Keep questions short, simple, and direct |
| Distractors vary along non-critical dimensions | For example, there's a question about the Terracotta Army (correct answer). Several distractors list "Terracotta Army" followed by additional extraneous text that goes beyond the scope of the initial question (e.g., "from the funerary temple ..." vs "from the burial complex ..." vs "from the mausoleum of ..."). | This ends up focusing the test on those minor details instead of the core concept. | Reword questions and responses so that the "answers" and "distractors" are very short (1--3ish words) |
| Answers can be determined from context without actually having expertise in the target area | Question: "What hardstone material, mined and carved in China since the Neolithic...". “jade” appears in 3 options so it must be jade. “gemstone” appears in 3 options so it must be gemstone. “virtue and purity” appears in 3 options so it must be that. B is the option that contains all of those, so the answer must be B. You can apply this logic to ~3/4 of the questions | This reduces the utility and signal provided by the questions (about knowledge), since correct responses end up reflecting ability to pattern match more than expertise or knowledge. | Carefully audit all questions to determine whether the content of EITHER the question or response options provides sufficient information in and of themselves to be able to easily guess the answer without actually having expertise in the tested area |
Suggested approach:
- Create a skill to audit and improve questions for a given domain (follow general approach of generate-questions skill)
- For each question in the given domain, audit carefully for the above issues and return a re-worded question + responses
- Do this across multiple passes:
- Pass 1: flag which issues in the table are present and re-word
- Pass 2: re-audit for all issues in the table. continue alternating between auditing + fixing until the question passes all audits.
- Then update the question.
- After all questions have been updated, we will need to re-embed all questions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingenhancementNew feature or requestNew feature or requestprompt-engineering