|
485 | 485 | "cell_type": "markdown", |
486 | 486 | "metadata": {}, |
487 | 487 | "source": [ |
488 | | - "The results are amazing and much more consistent and detailed than those obtained in January. Thanks to LLM training and `PydanticAI`, the output of the LLM is constrained to follow the schema, and can be readily consumed by a program. The model is also clever enough to find the relevant information on the website, even without us describing were to look. I noticed some issues with detecting the `issue`, which the model first attributed wrongly, but instructions to the model in natural language fixed that.\n", |
| 488 | + "The results are amazing and much more consistent and detailed than those obtained in January. Thanks to LLM training and `PydanticAI`, the output of the LLM is constrained to follow the schema, and can be readily consumed by a program. The model is also clever enough to find the relevant information on the website, even without us describing were to look. I noticed issues with detecting the `issue` and `reports`, but instructions to the model in natural language fixed those.\n", |
489 | 489 | "\n", |
490 | 490 | "`PydanticAI` enforces a good prompt structure and good practices, with its split of `system_prompt` and `instructions` and embedding information about the output format into the JSON schema.\n", |
491 | 491 | "\n", |
492 | | - "For a proper validation of the accuracy, we would have to run validation tests against a ground through, of course, so take these results with a caution.\n" |
| 492 | + "For a proper validation of the accuracy, we would have to run validation tests against a ground truth.\n" |
493 | 493 | ] |
494 | 494 | }, |
495 | 495 | { |
|
498 | 498 | "source": [ |
499 | 499 | "#### Bonus: Testing other models\n", |
500 | 500 | "\n", |
501 | | - "I tested some other models, which performed worse on this task. . It was producing the output in its thinking box, instead of the response for the user.\n", |
| 501 | + "I tested this with some other models, all of which performed worse on this task that the chosen one. Some models work better or at all with the `NativeOutput` mode of `PydanticAI`.\n", |
502 | 502 | "\n", |
503 | | - "| Model | Quant | Issue |\n", |
504 | | - "| -------------------------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |\n", |
505 | | - "| Qwen-2.5-coder-7b-instruct | Q8_0 | Works very well. Reduced performance with NativeOutput,<br> where it starts to mess up the `issue` field. |\n", |
506 | | - "| gpt-oss-20b | mxfp4 | Couldn't adhere to the format, it failed to produce<br>valid URLs for the `eprint` field. |\n", |
507 | | - "| Gemma-3-12b-it | Q4_0 | Doesn't work with Pydantic, it's complaining that user<br>and model messages must alternate. |\n", |
508 | | - "| Qwen3-4B-Thinking-2507 | Q6_K | Fails to return the result via tool call. With NativeOutput,<br>it doesn't adhere to the format, failed to produce valid<br>URLs for the `eprint` field. |\n", |
| 503 | + "- Qwen-2.5-coder-7b-instruct: Q8_0\n", |
| 504 | + " - Works very well out of the box, some small issues were fixed with prompting. Reduced performance with Pydantic's `NativeOutput` mode.\n", |
| 505 | + "- gpt-oss-20b: mxfp4\n", |
| 506 | + " - Couldn't adhere to the format, it failed to produce valid URLs for the `eprint` field.\n", |
| 507 | + "- Gemma-3-12b-it: Q4_0\n", |
| 508 | + " - Doesn't work with Pydantic, it's complaining that user and model messages must alternate.\n", |
| 509 | + "- Qwen3-4B-Thinking-2507: Q6_K\n", |
| 510 | + " - Fails to return the result via tool call. It works with `NativeOutput`, but fails to produce valid<br>URLs for the `eprint` field.\n", |
509 | 511 | "\n", |
510 | | - "Some models work better or at all with NativeOutput, but Qwen-2.5-coder-7b works best with the default tool calling. Some issues like those with the `eprint` field, could probably be improved with more specific prompts that address the specific errors that the model makes.\n" |
| 512 | + "Some issues that other models have, like those with the `eprint` field, could probably be improved with more specific prompts that address the specific errors that the model makes. \n", |
| 513 | + "\n", |
| 514 | + "Seeing how sensitive the performance is to prompting, even when the task is to produce well-structured output, **gives me pause about trusting public benchmarks**." |
511 | 515 | ] |
| 516 | + }, |
| 517 | + { |
| 518 | + "cell_type": "markdown", |
| 519 | + "metadata": {}, |
| 520 | + "source": [] |
512 | 521 | } |
513 | 522 | ], |
514 | 523 | "metadata": { |
|
0 commit comments