Version: 1.0
Author: Joaquim
Date: January 2025
Status: Draft
Manual exploratory testing is time-consuming, inconsistent, and fails to scale. QA teams spend significant effort navigating applications, identifying edge cases, and documenting findings—work that could be augmented by intelligent automation.
An autonomous AI agent that explores web applications intelligently, discovers potential issues, and generates actionable reports. The agent combines LLM reasoning with browser automation to make human-like testing decisions while maintaining human oversight through a CLI interface.
| Metric | Target |
|---|---|
| Bug Discovery | Identify ≥5 distinct issues in target application |
| Coverage | Explore ≥4 major functional areas |
| Report Quality | Actionable findings with evidence (screenshots) |
| Human Control | Clear intervention points with meaningful summaries |
- Autonomous Web Exploration: Navigate pages, interact with UI elements, fill forms
- Intelligent Decision Making: LLM-driven analysis of page state and next actions
- Human-in-the-Loop CLI: Periodic checkpoints for user guidance
- Custom Tooling: Broken image detector with comprehensive edge case handling
- Structured Reporting: Findings report with severity, evidence, and coverage summary
- Stretch Goals: Cloud deployment, persistence, test generation, MCP server
- Visual regression testing (pixel comparison)
- Performance/load testing
- Security penetration testing (beyond basic input validation)
- Mobile-specific testing
- Multi-browser support (Chromium only for MVP)
URL: https://with-bugs.practicesoftwaretesting.com
Functional Areas to Explore:
- Product browsing and search
- Product details and filtering
- Shopping cart operations
- Checkout flow
- User authentication (login/register)
- User profile management
- Contact/support forms
- Description: Agent navigates between pages using links, buttons, and URL manipulation
- Acceptance Criteria:
- Successfully follow internal links
- Handle navigation failures gracefully
- Track visited pages to avoid infinite loops
- Support back/forward navigation when beneficial
- Description: Agent interacts with all common UI elements
- Acceptance Criteria:
- Click buttons, links, and interactive elements
- Fill text inputs, textareas, and rich text editors
- Select options from dropdowns and radio buttons
- Handle checkboxes and toggles
- Submit forms
- Handle modal dialogs and popups
- Description: Agent captures visual evidence of findings
- Acceptance Criteria:
- Full page screenshots on significant events
- Element-specific screenshots for issue evidence
- Organized storage with meaningful naming
- Linked to findings in final report
- Description: Agent extracts and analyzes page content for LLM processing
- Acceptance Criteria:
- Extract visible text content
- Identify interactive elements with their attributes
- Capture form field states and validation messages
- Detect error messages and alerts
- Extract structured data (tables, lists, product info)
- Description: LLM analyzes current page to understand context
- Acceptance Criteria:
- Identify page type/purpose
- Recognize available actions
- Detect potential issues or anomalies
- Understand application state (logged in, cart contents, etc.)
- Description: LLM decides next action based on analysis
- Acceptance Criteria:
- Generate hypothesis about what to test
- Prioritize high-value interactions
- Provide clear reasoning for decisions
- Balance exploration vs exploitation
- Consider testing edge cases (empty inputs, special characters, boundaries)
- Description: Agent identifies potential bugs and UX issues
- Acceptance Criteria:
- Detect JavaScript errors in console
- Identify HTTP errors (4xx, 5xx)
- Recognize broken functionality
- Note UX issues (confusing flows, missing feedback)
- Identify accessibility concerns
- Detect data inconsistencies
- Description: Agent presents exploration progress to user
- Acceptance Criteria:
- Pages visited count and list
- Actions performed summary
- Issues found so far
- Current location in application
- Time elapsed
- Description: Agent pauses for human input at defined points
- Acceptance Criteria:
- After N actions (configurable, default: 10)
- When entering new major section
- When confidence in next action is low
- After finding significant issue
- When stuck or detecting potential infinite loop
- Description: User can guide agent behavior
- Acceptance Criteria:
- Continue: Proceed with agent's plan
- Stop: End exploration and generate report
- Guide: Provide specific direction (e.g., "focus on checkout")
- Skip: Avoid certain areas
- Prioritize: Focus on specific functionality
- Description: Scan page for all image elements
- Acceptance Criteria:
- Find all
<img>elements - Find CSS background images
- Find images in
<picture>elements - Find SVG images (inline and external)
- Find favicon and other meta images
- Find all
- Description: Determine if images are broken
- Acceptance Criteria:
- Detect HTTP errors (404, 403, 500, etc.)
- Detect network failures
- Detect empty/invalid src attributes
- Detect zero-dimension images (naturalWidth/naturalHeight = 0)
- Detect images that timeout
- Handle lazy-loaded images appropriately
- Description: Return detailed broken image information
- Acceptance Criteria:
- Image source URL
- Alt text (if present)
- Location on page (selector/xpath)
- Failure reason (categorized)
- Parent element context
- Severity assessment
- Description: Generate comprehensive bug report
- Acceptance Criteria:
- List all discovered issues
- Severity classification (Critical, High, Medium, Low)
- Steps to reproduce
- Expected vs actual behavior
- Screenshot evidence
- Timestamp and page URL
- Description: Document exploration coverage
- Acceptance Criteria:
- Pages visited with timestamps
- Functional areas covered
- Actions performed per area
- Unexplored areas identified
- Session duration and statistics
- Description: Output format and structure
- Acceptance Criteria:
- Markdown format for readability
- JSON format for programmatic access
- HTML format for visual presentation (optional)
- Screenshots embedded or linked
- Exportable and shareable
| Requirement | Target |
|---|---|
| Action execution time | < 5 seconds per action |
| LLM response time | < 10 seconds per decision |
| Total exploration session | 15-30 minutes typical |
| Memory usage | < 2GB RAM |
- Error Recovery: Agent recovers from transient failures (network, element not found)
- State Consistency: Agent maintains consistent state across actions
- Graceful Degradation: Agent completes partial report if terminated early
- Setup Time: < 10 minutes from clone to running
- Configuration: Sensible defaults with optional overrides
- Documentation: Clear README with examples
- Output Clarity: Reports understandable by non-technical stakeholders
- Code Quality: TypeScript strict mode, ESLint, Prettier
- Modularity: Clear separation of concerns
- Extensibility: Easy to add new tools, detectors, or LLM providers
- Testing: Unit tests for critical components
Priority: High
Description: Deploy agent to cloud for remote execution
Requirements:
- Containerized deployment (Docker)
- Trigger via HTTP endpoint or CLI
- Support for scheduled runs
- Results stored and retrievable
- Cost-effective infrastructure
Suggested Platforms: Railway, Render, AWS Lambda + Fargate
Priority: Medium
Description: Save and resume exploration state
Requirements:
- Serialize exploration state (visited pages, findings, context)
- Resume from checkpoint
- Merge findings from multiple sessions
- Handle application state changes between sessions
Priority: Medium
Description: Generate Playwright test scripts from findings
Requirements:
- Convert issue reproduction steps to Playwright code
- Generate regression tests for discovered bugs
- Include assertions based on expected behavior
- Organize tests by feature area
Priority: High (aligns with current expertise)
Description: Expose agent as MCP server
Requirements:
- Implement MCP protocol
- Expose tools:
start_exploration,get_status,get_findings,stop_exploration - Allow external LLMs to invoke agent
- Support streaming progress updates
| Component | Technology | Rationale |
|---|---|---|
| Language | TypeScript | Required by challenge |
| Browser Automation | Playwright | Required by challenge |
| Runtime | Node.js 20+ | LTS with modern features |
| Component | Options | Decision Criteria |
|---|---|---|
| Agent Framework | LangGraph vs Custom | See ADR-001 |
| LLM Provider | Anthropic vs OpenAI | See ADR-002 |
| State Management | In-memory vs Persistent | See ADR-003 |
US-001: As a QA engineer, I want to start the agent with a single command so that I can quickly begin exploration without complex setup.
US-002: As a QA engineer, I want to see what the agent is doing in real-time so that I can understand its reasoning and catch issues early.
US-003: As a QA engineer, I want to guide the agent toward specific areas so that I can focus testing on high-risk functionality.
US-004: As a QA engineer, I want a comprehensive report at the end so that I can prioritize and track discovered issues.
US-005: As a QA engineer, I want evidence (screenshots) for each finding so that I can reproduce and verify issues.
US-006: As a developer, I want clear reproduction steps so that I can debug and fix issues efficiently.
US-007: As a developer, I want generated test scripts so that I can prevent regressions.
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| LLM rate limiting | High | Medium | Implement retry logic, caching, batch decisions |
| Target site changes | Medium | Low | Configurable selectors, adaptive discovery |
| Infinite exploration loops | High | Medium | Visit tracking, action limits, diversity scoring |
| LLM hallucination in actions | High | Medium | Validate actions against actual page state |
| Cost overrun (LLM tokens) | Medium | Medium | Token budgets, summarization strategies |
- Bug Discovery Rate: ≥5 genuine issues found
- False Positive Rate: <20% of reported issues
- Coverage Breadth: ≥4 distinct functional areas explored
- Session Efficiency: Complete meaningful exploration in <30 minutes
- Report Actionability: Issues have clear reproduction steps
- Code Quality: Passes code review standards
- Architecture Clarity: Easy to understand and extend
- Documentation Quality: README enables quick start
| Phase | Duration | Deliverables |
|---|---|---|
| Planning & Design | Day 1 | PRD, ADRs, Architecture diagram |
| Core Implementation | Days 2-3 | Agent, tools, CLI interface |
| Integration & Testing | Day 4 | End-to-end testing, bug fixes |
| Documentation & Polish | Day 5 | README, video, final report |
- Agent: Autonomous software that perceives environment and takes actions
- LLM: Large Language Model (e.g., Claude, GPT-4)
- Human-in-the-Loop: Pattern where humans can intervene in automated processes
- MCP: Model Context Protocol - standard for LLM tool integration
- Exploration: Process of navigating and interacting with application to discover behavior
- Target Application: https://with-bugs.practicesoftwaretesting.com
- LangGraph Documentation: https://langchain-ai.github.io/langgraph/
- Playwright Documentation: https://playwright.dev/
- MCP Specification: https://modelcontextprotocol.io/