-
Notifications
You must be signed in to change notification settings - Fork 466
Description
Summary
Add a --safe flag to gmail thread get (aka gmail read) and gmail get that strips all HTML, removes URLs, and decodes HTML entities, leaving only plain text in the output. The flag is fully opt-in; without it, behavior is unchanged.
Motivation
This tool is increasingly used in automated pipelines and by LLM-based agents that consume --json output to read and reason about emails. This creates a number of risk categories:
-
Phishing & malicious content: URLs in email bodies (both
text/plainand strippedtext/html) are displayed as-is. A user or automation that follows a link from an untrusted email is exposed to phishing, credential harvesting, or malware downloads. -
Prompt injection via email: When LLMs process raw email content from
--jsonoutput, malicious emails can embed prompt injection payloads in HTML, URLs, or entity-encoded text. Sanitizing the content before it reaches the LLM reduces this attack surface. -
Tracking: Even in a CLI context, URLs copied from output can trigger open-tracking pixels and link-click tracking. Users reading emails in a "read-only" mindset may not expect tracking URLs to survive into the output.
Currently there is no way to request a "safe" read of an email.
Current behavior
text/plainbodies are displayed verbatim, including all URLstext/htmlfallback uses regex-based tag stripping (<[^>]*>), which can be bypassed by malformed HTML- HTML entities (e.g.
https://evil.com) are not decoded, allowing obfuscated URLs to pass through --jsonoutput includes the full raw Gmail API response with unsanitized HTML body data- No mechanism exists to strip URLs or dangerous content
Expected behavior (with --safe)
- HTML is converted to text using a proper HTML parser (
golang.org/x/net/htmltokenizer), not regex, correctly handling malformed tags, nested structures, and edge cases (this package is already an indirect dependency (imported forcharsetsupport), so it adds no new dependencies) - All
http://andhttps://URLs are replaced with[url removed] - HTML entities are decoded before URL detection, catching obfuscated URLs like
https://... <script>and<style>blocks are fully removed- Headers (Subject, etc.) are also sanitized
- In JSON mode: a
bodiesmap (keyed by message ID) provides sanitized text, and raw body data is cleared from the message payload to prevent downstream tools from accessing unsanitized content - Without
--safe: zero changes to existing behavior
Implementation
I've already implemented this and have it working with tests. Happy to open a PR if this proposal is approved.