Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .jules/sentinel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
## 2025-02-27 - [Fix XXE Vulnerability in Scrapers]
**Vulnerability:** Found `xml.etree.ElementTree.fromstring` parsing untrusted XML data (RSS feeds) in `functions/scrapers/theverge.py` and `functions/scrapers/producthunt.py`. This is vulnerable to XML Entity Expansion (Billion Laughs) and XXE attacks.
**Learning:** Standard library XML parsers in Python are not secure by default. Untrusted input must not be passed to them.
**Prevention:** Always use `defusedxml.ElementTree.fromstring` as a drop-in replacement when parsing XML from external or untrusted sources.
1 change: 1 addition & 0 deletions functions/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ beautifulsoup4==4.*
feedparser==6.*
openai==1.*
tzdata
defusedxml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's a good practice to pin dependency versions to ensure reproducible builds. Consider specifying a version for defusedxml to align with the versioning style of other packages in this file.

defusedxml==0.7.*

5 changes: 3 additions & 2 deletions functions/scrapers/producthunt.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import httpx
from typing import List, Dict, Any
from datetime import datetime
from defusedxml.ElementTree import fromstring
import xml.etree.ElementTree as ET
Comment on lines +9 to 10

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To make the code cleaner and use defusedxml as a complete drop-in replacement, it's better to replace both of these imports with a single import: from defusedxml import ElementTree as ET.

Then, on line 85, you should change fromstring(response.content) to ET.fromstring(response.content). This makes it clear that the secure version is being used throughout the module via the ET alias and you no longer need to import from the standard xml.etree.ElementTree (which is currently kept for ET.ParseError).

Suggested change
from defusedxml.ElementTree import fromstring
import xml.etree.ElementTree as ET
from defusedxml import ElementTree as ET

from bs4 import BeautifulSoup

Expand Down Expand Up @@ -80,8 +81,8 @@ def fetch_producthunt(limit: int = 10) -> List[Dict[str, Any]]:
print(f"Product Hunt RSS returned {response.status_code}")
return []

# Parse RSS feed
root = ET.fromstring(response.content)
# Parse RSS feed securely
root = fromstring(response.content)

products = []
items = root.findall('.//item')
Expand Down
6 changes: 3 additions & 3 deletions functions/scrapers/theverge.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import httpx
from typing import List, Dict, Any
from datetime import datetime
import xml.etree.ElementTree as ET
from defusedxml.ElementTree import fromstring

try:
from ..resilience import retry_with_backoff
Expand Down Expand Up @@ -39,8 +39,8 @@ def fetch_theverge(limit: int = 10) -> List[Dict[str, Any]]:
response = client.get(VERGE_RSS_URL, headers=headers)
response.raise_for_status()

# Parse XML/Atom feed
root = ET.fromstring(response.content)
# Parse XML/Atom feed securely
root = fromstring(response.content)

# The Verge uses Atom format
namespace = {'atom': 'http://www.w3.org/2005/Atom'}
Expand Down