u

HDembinski · HDembinski · commit a26ccb6608fb · 2025-10-19T12:11:21.000+02:00
diff --git a/posts/parsing_webpages_with_llm_revisited.ipynb b/posts/parsing_webpages_with_llm_revisited.ipynb
@@ -485,11 +485,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The results are amazing and much more consistent and detailed than those obtained in January. Thanks to LLM training and `PydanticAI`, the output of the LLM is constrained to follow the schema, and can be readily consumed by a program. The model is also clever enough to find the relevant information on the website, even without us describing were to look. I noticed some issues with detecting the `issue`, which the model first attributed wrongly, but instructions to the model in natural language fixed that.\n",
+    "The results are amazing and much more consistent and detailed than those obtained in January. Thanks to LLM training and `PydanticAI`, the output of the LLM is constrained to follow the schema, and can be readily consumed by a program. The model is also clever enough to find the relevant information on the website, even without us describing were to look. I noticed issues with detecting the `issue` and `reports`, but instructions to the model in natural language fixed those.\n",
     "\n",
     "`PydanticAI` enforces a good prompt structure and good practices, with its split of `system_prompt` and `instructions` and embedding information about the output format into the JSON schema.\n",
     "\n",
-    "For a proper validation of the accuracy, we would have to run validation tests against a ground through, of course, so take these results with a caution.\n"
+    "For a proper validation of the accuracy, we would have to run validation tests against a ground truth.\n"
    ]
   },
   {
@@ -498,17 +498,26 @@
    "source": [
     "#### Bonus: Testing other models\n",
     "\n",
-    "I tested some other models, which performed worse on this task. . It was producing the output in its thinking box, instead of the response for the user.\n",
+    "I tested this with some other models, all of which performed worse on this task that the chosen one. Some models work better or at all with the `NativeOutput` mode of `PydanticAI`.\n",
     "\n",
-    "| Model                      | Quant | Issue                                                                                                                                                    |\n",
-    "| -------------------------- | ----- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |\n",
-    "| Qwen-2.5-coder-7b-instruct | Q8_0  | Works very well. Reduced performance with NativeOutput,<br> where it starts to mess up the `issue` field.                                                |\n",
-    "| gpt-oss-20b                | mxfp4 | Couldn't adhere to the format, it failed to produce<br>valid URLs for the `eprint` field.                                                                |\n",
-    "| Gemma-3-12b-it             | Q4_0  | Doesn't work with Pydantic, it's complaining that user<br>and model messages must alternate.                                                             |\n",
-    "| Qwen3-4B-Thinking-2507     | Q6_K  | Fails to return the result via tool call. With NativeOutput,<br>it doesn't adhere to the format, failed to produce valid<br>URLs for the `eprint` field. |\n",
+    "- Qwen-2.5-coder-7b-instruct: Q8_0\n",
+    "    - Works very well out of the box, some small issues were fixed with prompting. Reduced performance with Pydantic's `NativeOutput` mode.\n",
+    "- gpt-oss-20b: mxfp4\n",
+    "    - Couldn't adhere to the format, it failed to produce valid URLs for the `eprint` field.\n",
+    "- Gemma-3-12b-it: Q4_0\n",
+    "    - Doesn't work with Pydantic, it's complaining that user and model messages must alternate.\n",
+    "- Qwen3-4B-Thinking-2507: Q6_K\n",
+    "    - Fails to return the result via tool call. It works with `NativeOutput`, but fails to produce valid<br>URLs for the `eprint` field.\n",
     "\n",
-    "Some models work better or at all with NativeOutput, but Qwen-2.5-coder-7b works best with the default tool calling. Some issues like those with the `eprint` field, could probably be improved with more specific prompts that address the specific errors that the model makes.\n"
+    "Some issues that other models have, like those with the `eprint` field, could probably be improved with more specific prompts that address the specific errors that the model makes. \n",
+    "\n",
+    "Seeing how sensitive the performance is to prompting, even when the task is to produce well-structured output, **gives me pause about trusting public benchmarks**."
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
   }
  ],
  "metadata": {