Rework web scraping example.

jtkiley · jtkiley · commit d82723cdb314 · 2025-01-03T21:04:54.000Z
diff --git a/notebooks/2b_retrieval1.ipynb b/notebooks/2b_retrieval1.ipynb
@@ -47,7 +47,7 @@
     "1. Extract the data from the pages.\n",
     "1. Clean and save the resulting data.\n",
     "\n",
-    "Let's walk through an example of getting press releases from the [Microsoft website](https://news.microsoft.com/category/press-releases/).\n",
+    "Let's walk through an example of getting press releases from the [Alphabet website](https://abc.xyz/investor/news/2024/).\n",
     "\n",
     "I often prefer to work out of order as follows:\n",
     "\n",
@@ -91,14 +91,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "_AGENT = \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0\"\n",
-    "\n",
-    "pr_url_1 = (\n",
-    "    \"https://news.microsoft.com/2018/10/04/\"\n",
-    "    \"redline-communications-and-microsoft-announce-\"\n",
-    "    \"partnership-to-lower-the-cost-of-tv-white-space-solutions/\"\n",
+    "AGENT = (\n",
+    "    \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36\"\n",
+    "    \" (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.3\"\n",
     ")\n",
-    "pr_req_1 = requests.get(pr_url_1, headers={\"User-Agent\": _AGENT})"
+    "\n",
+    "pr_url_1 = \"https://abc.xyz/2024-1010/\"\n",
+    "\n",
+    "pr_req_1 = requests.get(pr_url_1, headers={\"User-Agent\": AGENT})"
    ]
   },
   {
@@ -111,6 +111,54 @@
     "pr_req_1.status_code"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Encoding\n",
+    "\n",
+    "This is a very deep topic that we only need to barely touch.\n",
+    "In short, there are many standards for representing text as mappings of bytes (eight 0 or 1 values).\n",
+    "Many of them have significant overlap (based on underlying standards that they are a superset of), such that they at least mostly work, but it's better if we're sure we're using the right encoding.\n",
+    "\n",
+    "In our example here, the server sends data in such a way that we would infer that the text is in the `ISO-8859-1` encoding, though it is actually in the `UTF-8` encoding.\n",
+    "Fortunately, `requests` can tell us both what the encoding is and what it thinks it actually is, so we can build upon that."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pr_req_1.encoding"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pr_req_1.apparent_encoding"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pr_req_1.encoding = pr_req_1.apparent_encoding"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Extracting content"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -208,36 +256,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "pr_soup_1.find(\"div\", {\"class\": \"entry-content m-blog-content\"}).find(\"h3\").text"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "pr_data_1[\"h3\"] = (\n",
-    "    pr_soup_1.find(\"div\", {\"class\": \"entry-content m-blog-content\"}).find(\"h3\").text\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "pr_data_1"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "pr_soup_1.find(\"div\", {\"class\": \"entry-content m-blog-content\"}).find_all(\"p\")"
+    "pr_soup_1.find(\"div\", {\"class\": \"RichTextArticleBody RichTextBody\"}).find_all(\"p\")"
    ]
   },
   {
@@ -251,7 +270,7 @@
     "    [\n",
     "        i.text\n",
     "        for i in pr_soup_1.find(\n",
-    "            \"div\", {\"class\": \"entry-content m-blog-content\"}\n",
+    "            \"div\", {\"class\": \"RichTextArticleBody RichTextBody\"}\n",
     "        ).find_all(\"p\")\n",
     "    ]\n",
     ")"
@@ -288,26 +307,19 @@
     "def get_data_from_soup(soup):\n",
     "    data = {}\n",
     "    for meta in _METAS:\n",
-    "        if soup.find(\"meta\", property=meta) is not None:\n",
+    "        try:\n",
     "            prop = soup.find(\"meta\", property=meta)[\"property\"]\n",
-    "        if soup.find(\"meta\", property=meta) is not None:\n",
     "            content = soup.find(\"meta\", property=meta)[\"content\"]\n",
-    "        if prop is not None and content is not None:\n",
-    "            data.update({prop: content})\n",
-    "    try:\n",
-    "        data[\"h3\"] = (\n",
-    "            soup.find(\"div\", {\"class\": \"entry-content m-blog-content\"})\n",
-    "            .find(\"h3\")\n",
-    "            .string\n",
-    "        )\n",
-    "    except AttributeError:\n",
-    "        data[\"h3\"] = \"\"\n",
+    "        except TypeError:\n",
+    "            prop = meta\n",
+    "            content = \"\"\n",
+    "        data.update({prop: content})\n",
     "\n",
     "    data[\"body\"] = \"\\n\\n\".join(\n",
     "        [\n",
     "            i.text\n",
     "            for i in soup.find(\n",
-    "                \"div\", {\"class\": \"entry-content m-blog-content\"}\n",
+    "                \"div\", {\"class\": \"RichTextArticleBody RichTextBody\"}\n",
     "            ).find_all(\"p\")\n",
     "        ]\n",
     "    )\n",
@@ -340,8 +352,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "many_pr_url_1 = \"https://news.microsoft.com/category/press-releases/\"\n",
-    "many_pr_page_1 = requests.get(many_pr_url_1, headers={\"User-Agent\": _AGENT}).text\n",
+    "many_pr_url_1 = \"https://abc.xyz/investor/news/2024/\"\n",
+    "many_pr_page_1 = requests.get(many_pr_url_1, headers={\"User-Agent\": AGENT}).text\n",
     "many_pr_soup_1 = BeautifulSoup(many_pr_page_1)"
    ]
   },
@@ -351,8 +363,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Almost, but note the ones at the bottom.\n",
-    "many_pr_soup_1.find(\"section\", id=\"primary\").find_all(\"a\")"
+    "# Here, we find the div containing the listings and then find the links within.\n",
+    "many_pr_soup_1.find(\"div\", {\"class\": \"PageListW-items\"}).find_all(\"a\")"
    ]
   },
   {
@@ -361,10 +373,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Here, we further filter down to articles and then get their hrefs to\n",
-    "#    eliminate the navigation links at the bottom.\n",
-    "articles = many_pr_soup_1.find(\"section\", id=\"primary\").find_all(\"article\")\n",
-    "links = [i.find(\"a\")[\"href\"] for i in articles]\n",
+    "# Then, for each of the anchor tags, we can extract the links themselves.\n",
+    "articles = many_pr_soup_1.find(\"div\", {\"class\": \"PageListW-items\"}).find_all(\"a\")\n",
+    "links = [i[\"href\"] for i in articles]\n",
     "links"
    ]
   },
@@ -392,15 +403,17 @@
    "source": [
     "# We need to turn links into soup objects a lot, so let's make a function.\n",
     "def link_to_soup(link):\n",
-    "    page = requests.get(link, headers={\"User-Agent\": _AGENT}).text\n",
+    "    page_request = requests.get(link, headers={\"User-Agent\": AGENT})\n",
+    "    page_request.encoding = page_request.apparent_encoding\n",
+    "    page = page_request.text\n",
     "    soup = BeautifulSoup(page)\n",
     "    return soup\n",
     "\n",
     "\n",
     "def get_links_from_link_page(link_page):\n",
     "    soup = link_to_soup(link_page)\n",
-    "    articles = soup.find(\"section\", id=\"primary\").find_all(\"article\")\n",
-    "    links = [i.find(\"a\")[\"href\"] for i in articles]\n",
+    "    articles = soup.find(\"div\", {\"class\": \"PageListW-items\"})\n",
+    "    links = [i[\"href\"] for i in articles]\n",
     "    return links\n",
     "\n",
     "\n",
@@ -419,8 +432,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "msft_prs = pd.DataFrame(get_data_from_links(many_pr_links_1))\n",
-    "msft_prs.head()"
+    "alphabet_prs = pd.DataFrame(get_data_from_links(many_pr_links_1))\n",
+    "alphabet_prs.head()"
    ]
   },
   {
@@ -429,18 +442,12 @@
    "source": [
     "# Further automation\n",
     "\n",
-    "**Note**: for running time reasons, we're not going to make a multi-links-page version, but note that there's a next page link at the bottom of those pages that can be extracted to build that:\n",
-    "\n",
-    "```html\n",
-    "<a href=\"/category/press-releases/page/2/?paged=3\" \n",
-    "   class=\"c-glyph x-hidden-focus\" \n",
-    "   aria-label=\"Go to next page\" ms.title=\"Next Page\">\n",
-    "```\n",
+    "**Note**: for running time reasons, we're not going to make a multi-links-page version, but note that there are year links on the left of the listing pages that can be extracted.\n",
     "\n",
-    "However, we could also notice that the link pages have a number in the URL that is incremented by one for each page.\n",
-    "We would have to look at a page to get the end number, but we could also simply use a loop to construct a URL for each of those numbers.\n",
+    "However, we could also notice that the link pages have a year in the URL.\n",
+    "We would have to look at a page to get the earliest year, but we could otherwise simply use a loop to construct a URL for each of those years.\n",
     "\n",
-    "`https://news.microsoft.com/category/press-releases/page/2/`"
+    "`https://abc.xyz/investor/news/2023/`"
    ]
   }
  ],
@@ -460,7 +467,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.4"
+   "version": "3.11.11"
   },
   "vscode": {
    "interpreter": {