Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .jules/sentinel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
## 2024-05-20 - Prevent XML External Entity (XXE) and Billion Laughs Vulnerabilities
**Vulnerability:** Use of standard library `xml.etree.ElementTree` to parse untrusted XML/RSS feeds in `theverge.py` and `producthunt.py` scrapers.
**Learning:** The built-in `xml.etree` module in Python is vulnerable to malicious XML payloads such as XML External Entities (XXE) and Billion Laughs attacks. Parsing feeds from external, untrusted sources without defensive measures creates severe security risks (DoS or data exfiltration).
**Prevention:** Always use `defusedxml.ElementTree` instead of `xml.etree.ElementTree` when parsing any XML data from an untrusted source or network request. Ensure `defusedxml` is included in project dependencies.
1 change: 1 addition & 0 deletions functions/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ beautifulsoup4==4.*
feedparser==6.*
openai==1.*
tzdata
defusedxml==0.*

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For security-related dependencies like defusedxml, it's a best practice to pin to an exact version (e.g., defusedxml==0.7.1) rather than using a wildcard. This ensures deterministic builds and prevents unexpected changes from future releases from being pulled in automatically. This is especially important for libraries that are part of a security fix.

defusedxml==0.7.1

10 changes: 7 additions & 3 deletions functions/scrapers/producthunt.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import httpx
from typing import List, Dict, Any
from datetime import datetime
import xml.etree.ElementTree as ET
import defusedxml.ElementTree as ET
from bs4 import BeautifulSoup


Expand Down Expand Up @@ -114,8 +114,12 @@ def fetch_producthunt(limit: int = 10) -> List[Dict[str, Any]]:
for entry in entries[:limit]:
title = entry.find('atom:title', atom_ns)
link = entry.find('atom:link', atom_ns)
summary = entry.find('atom:summary', atom_ns) or entry.find('atom:content', atom_ns)
published = entry.find('atom:published', atom_ns) or entry.find('atom:updated', atom_ns)
summary = entry.find('atom:summary', atom_ns)
if summary is None:
summary = entry.find('atom:content', atom_ns)
published = entry.find('atom:published', atom_ns)
if published is None:
published = entry.find('atom:updated', atom_ns)

if title is None or link is None:
continue
Expand Down
2 changes: 1 addition & 1 deletion functions/scrapers/theverge.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import httpx
from typing import List, Dict, Any
from datetime import datetime
import xml.etree.ElementTree as ET
import defusedxml.ElementTree as ET

try:
from ..resilience import retry_with_backoff
Expand Down