Skip to content

Commit a26ccb6

Browse files
committed
u
1 parent 4e3e052 commit a26ccb6

File tree

1 file changed

+19
-10
lines changed

1 file changed

+19
-10
lines changed

posts/parsing_webpages_with_llm_revisited.ipynb

Lines changed: 19 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -485,11 +485,11 @@
485485
"cell_type": "markdown",
486486
"metadata": {},
487487
"source": [
488-
"The results are amazing and much more consistent and detailed than those obtained in January. Thanks to LLM training and `PydanticAI`, the output of the LLM is constrained to follow the schema, and can be readily consumed by a program. The model is also clever enough to find the relevant information on the website, even without us describing were to look. I noticed some issues with detecting the `issue`, which the model first attributed wrongly, but instructions to the model in natural language fixed that.\n",
488+
"The results are amazing and much more consistent and detailed than those obtained in January. Thanks to LLM training and `PydanticAI`, the output of the LLM is constrained to follow the schema, and can be readily consumed by a program. The model is also clever enough to find the relevant information on the website, even without us describing were to look. I noticed issues with detecting the `issue` and `reports`, but instructions to the model in natural language fixed those.\n",
489489
"\n",
490490
"`PydanticAI` enforces a good prompt structure and good practices, with its split of `system_prompt` and `instructions` and embedding information about the output format into the JSON schema.\n",
491491
"\n",
492-
"For a proper validation of the accuracy, we would have to run validation tests against a ground through, of course, so take these results with a caution.\n"
492+
"For a proper validation of the accuracy, we would have to run validation tests against a ground truth.\n"
493493
]
494494
},
495495
{
@@ -498,17 +498,26 @@
498498
"source": [
499499
"#### Bonus: Testing other models\n",
500500
"\n",
501-
"I tested some other models, which performed worse on this task. . It was producing the output in its thinking box, instead of the response for the user.\n",
501+
"I tested this with some other models, all of which performed worse on this task that the chosen one. Some models work better or at all with the `NativeOutput` mode of `PydanticAI`.\n",
502502
"\n",
503-
"| Model | Quant | Issue |\n",
504-
"| -------------------------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |\n",
505-
"| Qwen-2.5-coder-7b-instruct | Q8_0 | Works very well. Reduced performance with NativeOutput,<br> where it starts to mess up the `issue` field. |\n",
506-
"| gpt-oss-20b | mxfp4 | Couldn't adhere to the format, it failed to produce<br>valid URLs for the `eprint` field. |\n",
507-
"| Gemma-3-12b-it | Q4_0 | Doesn't work with Pydantic, it's complaining that user<br>and model messages must alternate. |\n",
508-
"| Qwen3-4B-Thinking-2507 | Q6_K | Fails to return the result via tool call. With NativeOutput,<br>it doesn't adhere to the format, failed to produce valid<br>URLs for the `eprint` field. |\n",
503+
"- Qwen-2.5-coder-7b-instruct: Q8_0\n",
504+
" - Works very well out of the box, some small issues were fixed with prompting. Reduced performance with Pydantic's `NativeOutput` mode.\n",
505+
"- gpt-oss-20b: mxfp4\n",
506+
" - Couldn't adhere to the format, it failed to produce valid URLs for the `eprint` field.\n",
507+
"- Gemma-3-12b-it: Q4_0\n",
508+
" - Doesn't work with Pydantic, it's complaining that user and model messages must alternate.\n",
509+
"- Qwen3-4B-Thinking-2507: Q6_K\n",
510+
" - Fails to return the result via tool call. It works with `NativeOutput`, but fails to produce valid<br>URLs for the `eprint` field.\n",
509511
"\n",
510-
"Some models work better or at all with NativeOutput, but Qwen-2.5-coder-7b works best with the default tool calling. Some issues like those with the `eprint` field, could probably be improved with more specific prompts that address the specific errors that the model makes.\n"
512+
"Some issues that other models have, like those with the `eprint` field, could probably be improved with more specific prompts that address the specific errors that the model makes. \n",
513+
"\n",
514+
"Seeing how sensitive the performance is to prompting, even when the task is to produce well-structured output, **gives me pause about trusting public benchmarks**."
511515
]
516+
},
517+
{
518+
"cell_type": "markdown",
519+
"metadata": {},
520+
"source": []
512521
}
513522
],
514523
"metadata": {

0 commit comments

Comments
 (0)