Skip to content

Feature Request: --safe flag for gmail read/get to sanitize email content #220

@urasmutlu

Description

@urasmutlu

Summary

Add a --safe flag to gmail thread get (aka gmail read) and gmail get that strips all HTML, removes URLs, and decodes HTML entities, leaving only plain text in the output. The flag is fully opt-in; without it, behavior is unchanged.

Motivation

This tool is increasingly used in automated pipelines and by LLM-based agents that consume --json output to read and reason about emails. This creates a number of risk categories:

  1. Phishing & malicious content: URLs in email bodies (both text/plain and stripped text/html) are displayed as-is. A user or automation that follows a link from an untrusted email is exposed to phishing, credential harvesting, or malware downloads.

  2. Prompt injection via email: When LLMs process raw email content from --json output, malicious emails can embed prompt injection payloads in HTML, URLs, or entity-encoded text. Sanitizing the content before it reaches the LLM reduces this attack surface.

  3. Tracking: Even in a CLI context, URLs copied from output can trigger open-tracking pixels and link-click tracking. Users reading emails in a "read-only" mindset may not expect tracking URLs to survive into the output.

Currently there is no way to request a "safe" read of an email.

Current behavior

  • text/plain bodies are displayed verbatim, including all URLs
  • text/html fallback uses regex-based tag stripping (<[^>]*>), which can be bypassed by malformed HTML
  • HTML entities (e.g. &#104;ttps://evil.com) are not decoded, allowing obfuscated URLs to pass through
  • --json output includes the full raw Gmail API response with unsanitized HTML body data
  • No mechanism exists to strip URLs or dangerous content

Expected behavior (with --safe)

  • HTML is converted to text using a proper HTML parser (golang.org/x/net/html tokenizer), not regex, correctly handling malformed tags, nested structures, and edge cases (this package is already an indirect dependency (imported for charset support), so it adds no new dependencies)
  • All http:// and https:// URLs are replaced with [url removed]
  • HTML entities are decoded before URL detection, catching obfuscated URLs like &#104;ttps://...
  • <script> and <style> blocks are fully removed
  • Headers (Subject, etc.) are also sanitized
  • In JSON mode: a bodies map (keyed by message ID) provides sanitized text, and raw body data is cleared from the message payload to prevent downstream tools from accessing unsanitized content
  • Without --safe: zero changes to existing behavior

Implementation

I've already implemented this and have it working with tests. Happy to open a PR if this proposal is approved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions