Skip to content

RHEcosystemAppEng/self-service-agent-blueprint

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Self-Service Agent Quickstart

Table of Contents

  1. Introduction

  2. Prerequisites

  3. Hands-On Quickstart

  4. Performance & Scaling

  5. Security

  6. Going Deeper: Component Documentation

  7. Customizing for Your Use Case

  8. Next Steps and Additional Resources


1. INTRODUCTION

1.1 Who Is This For?

This quickstart guide is designed for:

  • IT teams implementing AI-driven self-service solutions
  • DevOps engineers deploying agent-based systems
  • Solution architects evaluating AI automation platforms
  • Organizations looking to streamline IT processes with generative AI

1.2 The Business Case for AI-Driven IT Self-Service

Many organizations are working to support IT processes through generative AI based self-service implementations. IT teams at Red Hat have already started on this journey. The team building this quickstart met with those teams to incorporate the lessons learned into this guide.

The key value propositions for implementing IT processes with generative AI include:

  • Reduced employee time to complete common requests. The system helps employees create their requests by helping them understand the options and required information for the request and helps employees submit those requests once they are ready.
  • Higher compliance to process standards. Requests will be more complete and aligned with process standards. This will reduce the need to contact the requesting employee for additional information and reduce time and effort to review and complete requests.
  • Fewer rejected requests due to missing/incorrect information. Rejected requests are frustrating for employees and leads to lower employee satisfaction. Avoiding request rejection and reducing back and forth on requests will improve employee satisfaction.
  • Shorter time to close a ticket. The system helps tickets to close faster, improving throughput and reducing ticket idle time.

1.3 Example Use Cases

IT processes that are suitable for automation with generative AI include:

  • Laptop refresh requests
  • Privacy Impact Assessment (PIA)
  • RFP generation
  • Access request processing
  • Software license requests

1.4 What This Quickstart Provides

This quickstart provides the framework, components and knowledge to accelerate your journey to deploying generative AI based self-service implementations. Many AI based IT process implementations should be able to share common components within an enterprise. The addition of agent configuration files, along with additional tools, knowledge bases, and evaluations, completes the implementation for a specific use case. Often no code changes to the common components will be required to add support for an additional use case.

1.5 What You'll Build

The quickstart provides implementations of the common components along with the process specific pieces needed to support the laptop refresh IT process as a concrete implementation.

Time to complete: 60-90 minutes (depending on deployment mode)

By the end of this quickstart, you will have:

  • A fully functional AI agent system deployed on OpenShift
  • A working laptop refresh agent with knowledge bases and MCP server tools
  • Experience interacting with the agent via CLI to test complete workflows
  • Completed evaluation runs demonstrating agent quality and business requirements
  • Understanding of distributed tracing for monitoring and troubleshooting
  • (Optional) Slack integration for real-time user conversations
  • (Optional) Email integration for asynchronous notifications
  • (Optional) ServiceNow integration for real ticket creation
  • (Optional) Safety shields for content moderation
  • Understanding of how to customize for your own use cases

Key Technologies You'll Learn

Throughout this quickstart, you'll gain hands-on experience with modern AI and cloud-native technologies:

AI & LLM Technologies:

Observability & Evaluation:

  • OpenTelemetry - Distributed tracing for monitoring complex agent interactions
  • Evaluation Framework - AI-specific testing with DeepEval for synthetic conversation generation and business metrics validation

Integration with existing systems:

  • Slack Integration - Real-time conversational interfaces
  • Email Integration - Asynchronous communication workflows
  • ServiceNow Integration - ITSM ticketing system connectivity

Cloud-Native Infrastructure:

  • OpenShift/Kubernetes - Container orchestration and deployment platform
  • Knative Eventing - Event-driven architecture for production deployments
  • Apache Kafka - Distributed event streaming for asynchronous communication
  • Helm - Kubernetes package manager for application deployment
  • PostgreSQL - Database for conversation state and checkpointing

This technology stack provides a production-ready foundation for building scalable, observable AI agent systems.

1.6 Architecture Overview

The self-service agent quickstart provides a reusable platform for building AI-driven IT processes:

Common Platform Architecture

In addition to the base components, the quickstart includes an evaluation framework and integration with OpenTelemetry support in OpenShift for observability.

Why Evaluations Matter:

Generative AI agents are non-deterministic by nature, meaning their responses can vary across conversations even with identical inputs. This makes traditional software testing approaches insufficient. The evaluation framework addresses this challenge by providing capabilities that are crucial for successfully developing and iterating on agentic IT process implementations. The framework validates business-specific requirements—such as policy compliance and information gathering—ensuring agents meet quality standards before deployment and catch regressions during updates.

Why Observability Matters:

Agentic systems involve complex interactions between multiple components—routing agents, specialist agents, knowledge bases, MCP servers, and external systems—making production debugging challenging without proper visibility. The OpenTelemetry integration provides distributed tracing across the entire request lifecycle, enabling teams to understand how requests flow through the system, identify performance bottlenecks, and diagnose issues in production. This visibility is essential for monitoring agent handoffs between routing and specialist agents, debugging failed external system integrations, and understanding user interaction patterns. By integrating with OpenShift's observability stack, teams gain unified monitoring across all platform components alongside their existing infrastructure metrics.

Key Request Flow:

  1. User initiates request through any communications channel (Slack, Email, API, Web)
  2. Request Manager validates request and routes to routing agent
  3. Routing agent interacts with the user to find out what the user needs
  4. Routing agent hands session off to specialist agent to complete the request
  5. Specialist agent interacts with user to complete request using available knowledge bases and MCP servers

1.7 Project Structure

The repository is organized into the following key directories:

Core Services:

  • agent-service/ - AI agent processing service with knowledge base management and LangGraph state machine
  • request-manager/ - Request routing, session management, and unified communication processing
  • integration-dispatcher/ - Multi-channel delivery (Slack, Email, Webhooks)
  • mock-eventing-service/ - Lightweight mock service for testing without Knative infrastructure

MCP Servers:

  • mcp-servers/snow/ - ServiceNow integration MCP server

Mock Services & Test Data:

  • mock-service-now/ - Mock ServiceNow REST API server for testing without real ServiceNow instance
  • mock-employee-data/ - Mock employee and laptop data library for testing integrations

Shared Libraries:

  • shared-models/ - Database models, Pydantic schemas, and Alembic migrations
  • shared-clients/ - Centralized HTTP client libraries for inter-service communication

Evaluation & Testing:

  • evaluations/ - Evaluation framework with conversation flows and metrics
  • test/ - Testing utilities and scripts

Infrastructure & Configuration:

  • helm/ - Kubernetes Helm charts for OpenShift deployment
  • agent-service/config/ - Agent configurations, knowledge bases, and LangGraph prompts
  • tracing-config/ - OpenTelemetry configuration for observability
  • scripts/ - CI/CD and container build scripts

Documentation:

  • guides/ - Step-by-step guides for integrations, deployment, and advanced features
  • docs/ - Technical documentation for developers

For detailed information on each component, see the Component Guide.

1.8 Laptop Refresh Implementation

The laptop refresh use case is implemented by adding the following components:

  1. MCP server integrated with ServiceNow that can:
    • retrieve existing laptop information for an employee
    • submit a refresh request on behalf of an employee
  2. Laptop refresh knowledge base that includes laptop refresh policy and available laptop options per region
  3. Routing agent that knows how to route to the laptop refresh agent
  4. Laptop refresh agent configured with prompts for the refresh process
  5. A set of evaluations specific to the laptop refresh IT process

User Authentication:

The user's email address is used as the authoritative identifier for the user. Communication channels like Slack and email provide pre-authenticated user identities, eliminating the need for separate authentication. The email address obtained from these channels is passed through the Request Manager to MCP servers, which use it to perform operations on behalf of the user—such as retrieving employee information and submitting refresh requests.

The ServiceNow MCP server authenticates to the ServiceNow backend using a service account, following standard ServiceNow integration patterns.

Example Conversation: Laptop Refresh Process

User: Hi, I'm looking to refresh my laptop, can you please guide me through the process and show me the available options?

Agent: Your laptop, an EliteBook 840 G7, was purchased on 2019-05-12 and is currently 6 years and 5 months old. The company's laptop refresh policy states that standard laptops will be refreshed every 3 years from the date of issuance. Since your laptop is older than 3 years, you are eligible for a laptop refresh. Would you like to proceed with reviewing the available laptop options for your location?

User: Yes, I'd like to proceed with reviewing the available laptop options for my location.

Agent: Here are the available laptop options for your location (NA):

1. Apple MacBook Air M3 (Code: apple_mac_book_air_m_3)

  • Target User: General Office
  • Cost: $1,199 USD
  • Operating System: macOS
  • Display: 13.6" (2560 x 1664)
  • Processor: Apple M3
  • Memory: 8 GB
  • Storage: 256 GB SSD
  • Weight: 2.7 lbs

2. Apple MacBook Pro 14 M3 Pro (Code: apple_mac_book_pro_14_m_3_pro)

  • Target User: Developers
  • Cost: $1,999 USD
  • Operating System: macOS
  • Display: 14.2" (3024 x 1964)
  • Processor: Apple M3 Pro
  • Memory: 18 GB
  • Storage: 512 GB SSD
  • Weight: 3.5 lbs

3. Lenovo ThinkPad T14 Gen 5 Intel (Code: lenovo_think_pad_t_14_gen_5_intel)

  • Target User: General Office
  • Cost: $1,399 USD
  • Operating System: Windows 11 Pro
  • Display: 14" (1920 x 1200)
  • Processor: Intel Core i7-1555U
  • Memory: 16 GB
  • Storage: 512 GB SSD
  • Weight: 3.2 lbs

4. Lenovo ThinkPad P1 Gen 7 (Code: lenovo_think_pad_p_1_gen_7)

  • Target User: Engineers
  • Cost: $2,499 USD
  • Operating System: Red Hat Linux
  • Display: 16" (2560 x 1600)
  • Processor: Intel Core i7-13800H
  • Memory: 32 GB
  • Storage: 1 TB SSD
  • Graphics: NVIDIA RTX A1000
  • Weight: 4.1 lbs

Please select one of the above options. Remember to consider your job role and requirements when making your selection.

User: I'd like to select option 1, the Apple MacBook Air M3, as it seems suitable for my general office needs.

Agent: You've selected the Apple MacBook Air M3. Would you like to proceed with the creation of a ServiceNow ticket for a laptop refresh?

User: Yes, please proceed with creating the ServiceNow ticket for the laptop refresh, and can you also let me know what information I need to provide for the ticket?

Agent: A ServiceNow ticket for a laptop refresh has been created for you. The ticket number is REQ3143724. Your request will be processed by the IT Hardware Team, and you will receive updates via email as the ticket progresses.

Is there anything else I can help you with?

1.9 Customizing for Your Use Case

To adapt this quickstart for your specific IT process:

  • Replace laptop refresh agent with your specialist agent (e.g., PIA, RFP)
  • Update the routing agent to be able to route to your new specialist agent
  • Add MCP servers for your external systems
  • Create knowledge base with your policies and documentation
  • Build evaluation suite for your business metrics

Now that you understand the architecture and capabilities of the self-service agent quickstart, the next section will guide you through the prerequisites and setup steps needed to deploy the system on your OpenShift cluster.


2. PREREQUISITES

What You Need

Local Tools:

  • Python 3.12+
  • uv - Fast Python package installer
  • Podman - Container runtime for building images
  • Helm - Kubernetes package manager
  • oc CLI - OpenShift command line tool
  • kubectl CLI - Kubernetes command line tool
  • git - Version control
  • make - Build automation (usually pre-installed on Linux/macOS, see GNU Make)

Cluster Environment:

Here's an example of a minimally required KnativeKafka CR that you can paste in for the CR when following the instructions for installing Knative Kafka w/broker functionality enabled -

kind: KnativeKafka
apiVersion: operator.serverless.openshift.io/v1alpha1
metadata:
  name: knative-kafka
  namespace: knative-eventing
spec:
  broker:
    enabled: true
  source:
    enabled: false # optional, not necessary for the self-service agent
  channel:
    enabled: false # optional, not necessary for the self-service agent

Important: If you experience kafka-webhook-eventing pod crashes due to memory issues (OOM kills), you can configure resource limits in the KnativeKafka CR using the workloads section. Here's an enhanced example with resource limits for the webhook:

kind: KnativeKafka
apiVersion: operator.serverless.openshift.io/v1alpha1
metadata:
  name: knative-kafka
  namespace: knative-eventing
spec:
  broker:
    enabled: true
  source:
    enabled: false
  channel:
    enabled: false
  # Configure resource limits for webhook to prevent OOM kills
  workloads:
  - name: kafka-webhook-eventing
    resources:
    - container: kafka-webhook-eventing
      requests:
        cpu: "100m"
        memory: "256Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"  # Increase if webhook is crashing due to OOM (try 1Gi or 2Gi for heavy load)

Access & Credentials:

  • OpenShift cluster access
  • Container registry (Quay.io or similar), OPTIONAL, only if you want to make changes
  • LLM API endpoint with credentials (Llama 3 70B model)
  • LLM API safety model endpoint with credentials, OPTIONAL if you want to enable safety shields (meta-llama/Llama-Guard-3-8B)
  • Slack workspace admin access (we provide instructions on how to set up a test instance), OPTIONAL if you want to explore integration with Slack
  • ServiceNow instance admin access (we provide instructinos on how to set up a test instance), OPTIONAL if you want to explore integration with ServiceNow

3. HANDS-ON QUICKSTART

This section walks you through deploying and testing the laptop refresh agent on OpenShift.

3.1 Clone the Repository

First, clone the repository and navigate to the project directory:

# Clone the repository
git clone https://github.com/RHEcosystemAppEng/self-service-agent-blueprint.git

# Navigate to the project directory
cd self-service-agent-blueprint

Expected outcome:

  • ✓ Repository cloned to local machine
  • ✓ Working directory set to project root

3.2 Deploy to OpenShift

Step 1: Choose Your Deployment Mode

For first deployment, we recommend Testing Mode (Mock Eventing):

  • No Knative operators required
  • Tests event-driven patterns
  • Simpler than production infrastructure

For detailed information about deployment modes, see the Deployment Mode Guide.

Step 2: Set Required Environment Variables

# Set your namespace
export NAMESPACE=your-namespace

# Set LLM configuration
export LLM=llama-3-3-70b-instruct-w8a8
export LLM_ID=llama-3-3-70b-instruct-w8a8
export LLM_API_TOKEN=your-api-token
export LLM_URL=https://your-llm-endpoint

# Set hugging face token, set to 1234 as not needed unless
# you want to use locally hosted LLM
export HF_TOKEN=1234 

Step 3: Build Container Images (Optional)

If using pre-built images which is recommended until later steps, skip this step.

# Build all images
# Set container registry make sure this is set when you run
# helm-install-test in later steps
export REGISTRY=quay.io/your-org

make build-all-images

# Push to registry
make push-all-images

Expected outcome: All images built and pushed to registry

Step 4: Deploy with Helm

# Login to OpenShift
oc login --server=https://your-cluster:6443

# Create namespace if needed
oc new-project $NAMESPACE

# Deploy in testing mode (Mock Eventing)
make helm-install-test NAMESPACE=$NAMESPACE

Expected outcome:

  • ✓ Helm chart deployed successfully
  • ✓ All pods running
  • ✓ Routes created

Step 5: Verify Deployment

# Check deployment status
make helm-status NAMESPACE=$NAMESPACE

# Check pods
oc get pods -n $NAMESPACE

# Check routes
oc get routes -n $NAMESPACE

Expected outcome:

  • All pods in Running state
  • Routes accessible
  • Agent service initialization completed successfully

You should now be able to:

  • ✓ Deploy the system to OpenShift
  • ✓ Monitor pods and services
  • ✓ Troubleshoot deployment issues

3.3 Interact with the CLI

Now that the system is deployed, let's interact with the agent through the CLI to test a complete laptop refresh workflow.

A key design philosophy of this quickstart is to "meet employees where they are." Rather than creating new communication channels, the system integrates with existing tools like Slack and Email through a general purpose request manager. This allows employees to interact using the communication platforms they already know and use daily.

For initial testing and evaluation purposes, the quickstart includes a simple command line interface (CLI). We'll use the CLI for these first interactions to verify the system works correctly.

Step 1: Start Interactive Chat Session

Use the CLI chat script to start an interactive conversation with the agent:

# Get the request manager pod
export REQUEST_MANAGER_POD=$(oc get pod -n $NAMESPACE -l app=self-service-agent-request-manager -o jsonpath='{.items[0].metadata.name}')

# Start interactive chat session
oc exec -it $REQUEST_MANAGER_POD -n $NAMESPACE -- \
  python test/chat-responses-request-mgr.py \
  --user-id alice.johnson@company.com

Expected outcome:

  • Chat client starts in interactive mode
  • Agent sends initial greeting
  • You see a prompt where you can type messages

Step 2: Complete Laptop Refresh Workflow

Follow this conversation flow to test the complete laptop refresh process:

You: I need help with my laptop refresh

Expected:

  • Agent greets you and retrieves your current laptop information
  • Agent checks your eligibility based on 3-year policy and gives you a summary

You: I would like to see available laptop options

Expected:

  • Agent presents available laptop options for your region (NA, EMEA, APAC, or LATAM)
  • You see 4 laptop options with specifications and pricing

You: I would like option 1, the Apple MacBook Air M2

Expected: Agent confirms your selection and asks for approval to create ServiceNow ticket

You: Yes, please create the ticket

Expected:

  • ServiceNow ticket created
  • Ticket number provided (format: REQ followed by digits)
  • Confirmation message with next steps

You: enter Cntrl-C twice

Expected: Chat session ends

Step 3: Test Different User Scenarios

Test with different employee IDs to see varied scenarios:

# Test with different user (LATAM region)
oc exec -it $REQUEST_MANAGER_POD -n $NAMESPACE -- \
  python test/chat-responses-request-mgr.py \
  --user-id maria.garcia@company.com

# Test with user who may not be eligible
oc exec -it $REQUEST_MANAGER_POD -n $NAMESPACE -- \
  python test/chat-responses-request-mgr.py \
  --user-id john.doe@company.com

Note: Only the laptop refresh use case has been implemented. References to other options like changing your email are included to illustrate the capabilities of the routing agent. If you ask the routing agent for those features, you will remain with the routing agent and/or may see failed attempts to route to a specialist agent that does not exist.

Note: The Request Manager retains conversation state across sessions. To restart from the beginning with the same user ID:

  1. Type: reset
  2. Type any message (e.g., hello) to start a fresh conversation

This clears all conversation history and context for that user.

Expected outcome:

  • Different laptop options based on region
  • Different eligibility results based on laptop age
  • Consistent agent behavior across scenarios

You should now be able to:

  • ✓ Interact with agents via CLI using interactive chat
  • ✓ Complete full laptop refresh workflow
  • ✓ Test conversation flows with different users
  • ✓ Verify agent behavior and responses
  • ✓ Test eligibility checking and region-specific options

3.4 Integration with Slack (Optional)

Slack integration enables real-world testing with actual users in your workspace. The quickstart assumes you have an existing Slack instance that you can use for testing, otherwise you can create a development instance by joining the Slack Developer Program.

Step 1: Tell the quickstart about your Slack email

The quickstart uses the user's email as the authoritative user ID, therefore, the deployment needs to know about the email associated with your Slack account. If you can set it to one of the emails in the MOCK_EMPLOYEE_DATA dictionary in data.py it will use the data for that email. Otherwise you need to set TEST_USERS so that it includes your email. The mock data lookups and real lookups from a ServiceNow instance will fill in laptop data for emails set in TEST_USERS.

To configure those emails export TEST_USERS as follows before running any of the other steps, replacing myemail@emaildomain.com with your email:

export TEST_USERS=myemail@emaildomain.com

Step 2: Set Up Slack App

See SLACK_SETUP.md for detailed instructions on how to configure a Slack bot that will allow you to interact with the agent through Slack.

Step 3: Test Slack Interaction

In your Slack workspace:

  1. Direct message the Slack bot using @Self-Service Agent and say hi
  2. Routing agent responds asking what you would like to do
  3. Respond with 'I would like to refresh my laptop'
  4. Specialist agent responds with laptop information and summary of eligibility
  5. Indicate you would like to proceed
  6. Agent presents available laptop options
  7. Select a laptop: I'd like option 1
  8. Agent confirms the selected laptop and asks if you would like it to create ServiceNow ticket for you.
  9. Say yes
  10. Agent creates ServiceNow ticket and provides ticket number

Just like when you used the CLI earlier, the request manager maintains state and you can use reset to clear the conversation history.

Expected outcome:

  • ✓ Bot responds in Slack thread
  • ✓ Conversation maintains context across multiple messages
  • ✓ Agent retrieves employee laptop info automatically (using Slack email)
  • ✓ Agent shows laptop options for employee's region
  • ✓ Ticket created with confirmation number

You should now be able to:

  • ✓ Interact with agents via Slack

3.5 Integration with Real ServiceNow (Optional)

The quickstart uses the user's email as the authoritative user ID. If you plan to interact with the quickstart using Slack or email, the deployment needs to know about the email associated with your Slack or email account. This is because unlike the command line, you don't have an option to set the email for the user.

The scripts used to setup a Real ServiceNow instance need to know the email you will be using with Slack or email so that employee data can be configured for users with those emails.

To configure those emails export TEST_USERS as follows before running any of the other steps, replacing myemail@emaildomain.com with your email:

export TEST_USERS=myemail@emaildomain.com

Step 1: Create ServiceNow Instance

There are two options for setting up a test ServiceNow instance. We recommend you use the Automated Setup, but you can use the manual setup if you want to better understand how the instance is being setup and configured. These guides include the required steps:

Step 2: Configure ServiceNow Credentials

In step one you will have noted these values:

  • SERVICENOW_INSTANCE_URL
  • SERVICENOW_API_KEY
  • SERVICENOW_LAPTOP_REFRESH_ID

Make sure they are exported in your environment and uninstall and reinstall the quickstart

# Set ServiceNow configuration
export SERVICENOW_INSTANCE_URL=https://your-instance.service-now.com
export SERVICENOW_API_KEY=your-servicenow-api-key
export SERVICENOW_LAPTOP_REFRESH_ID=your-servicenow-laptop-refresh-id

# redeploy the quickstart
make helm-uninstall NAMESPACE=$NAMESPACE
make helm-install-test NAMESPACE=$NAMESPACE

Step 3: Test with Real ServiceNow

Use the CLI chat client to initiate a laptop refresh request with your real ServiceNow account:

# Get the request manager pod
export REQUEST_MANAGER_POD=$(oc get pod -n $NAMESPACE -l app=self-service-agent-request-manager -o jsonpath='{.items[0].metadata.name}')

# Start chat session with your email
oc exec -it $REQUEST_MANAGER_POD -n $NAMESPACE -- \
  python test/chat-responses-request-mgr.py \
  --user-id alice.johnson@company.com

Then complete the laptop refresh workflow:

You: I need a laptop refresh

You: I would like to see available laptop options

You: I would like option [number]

You: Yes, please create the ticket

Expected outcome:

  • Agent retrieves your actual laptop data from ServiceNow
  • Agent creates real ServiceNow ticket when you confirm
  • Ticket appears in your ServiceNow instance
  • You receive ServiceNow notifications via email

Step 4: Verify Ticket Created in ServiceNow

Take note of the ServiceNow ticket number the agent returns:

Ticket Created Message

Log into your ServiceNow instance and:

  • Go to All -> search for "Requests" -> Click "Requests" link under Service Catalog/Open Records/Requests

Requests

  • Verify the ticket was created with a matching number:

Requests Table

  • Click the requested items link and verify correct user and laptop model are selected:

Requests Table

Requests Table

You should now be able to:

  • ✓ Connect to production ServiceNow instance
  • ✓ Create real tickets from agent conversations
  • ✓ Test end-to-end integration with backend systems
  • ✓ Validate data accuracy in ServiceNow

3.6 Integration with Email (Optional)

Email integration enables two-way communication with the AI agent through email, allowing users to interact with the system via their email client.

The quickstart uses the user's email as the authoritative user ID, therefore, the deployment needs to know about the email associated with your email account. You need to set TEST_USERS so that it includes your email. The mock data lookups and real lookups from a ServiceNow instance will fill in laptop data for emails set in TEST_USERS.

To configure those emails export TEST_USERS as follows before running any of the other steps, replacing myemail@emaildomain.com with your email:

export TEST_USERS="myemail@emaildomain.com"

Step 1: Set Up Email Configuration

See EMAIL_SETUP.md for detailed instructions.

Summary:

  1. Choose an email provider (Gmail, Outlook, or custom SMTP/IMAP)
  2. Get SMTP credentials for sending emails
  3. Get IMAP credentials for receiving emails (optional, for polling)
  4. Configure email account settings (enable IMAP if needed)

Step 2: Update Deployment with Email Credentials

# add your users to those that will get responses from moc and service
# now requests. This must match the email from which you will be sending
# an email to the system
export TEST_USERS=myemail@emaildomain.com

# Set email configuration
export SMTP_HOST=smtp.gmail.com
export SMTP_PORT=587
export SMTP_USERNAME=your-email@gmail.com
export SMTP_PASSWORD=your-app-password
export IMAP_HOST=imap.gmail.com
export IMAP_PORT=993

# Uninstall
make helm-uninstall NAMESPACE=$NAMESPACE

# Upgrade Helm deployment with email configuration
make helm-install-test NAMESPACE=$NAMESPACE \
  EXTRA_HELM_ARGS="\
    --set-string security.email.smtpHost=$SMTP_HOST \
    --set-string security.email.smtpPort=$SMTP_PORT \
    --set-string security.email.smtpUsername=$SMTP_USERNAME \
    --set-string security.email.smtpPassword=$SMTP_PASSWORD \
    --set-string security.email.smtpUseTls=true \
    --set-string security.email.fromEmail=$SMTP_USERNAME \
    --set-string security.email.fromName='Self-Service Agent' \
    --set-string security.email.imapHost=$IMAP_HOST \
    --set-string security.email.imapPort=$IMAP_PORT \
    --set-string security.email.imapUseSsl=true \
    --set-string security.email.imapMailbox=INBOX \
    --set-string security.email.imapPollInterval=60 \
    --set-string security.email.imapLeaseDuration=120"

Step 3: Verify Email Integration

Check the Integration Dispatcher health endpoint to confirm email integration is active:

# Check integration health and email capabilities
oc exec deployment/self-service-agent-integration-dispatcher -n $NAMESPACE -- \
  curl -s http://localhost:8080/health/detailed | jq '{integrations_available, email_capabilities: .services.email_capabilities}'

# Look for:
# - "EMAIL" in the integrations_available array
# - email_capabilities showing sending: true and/or receiving: true

Step 4: Test Email Interaction

Send an email to the configured email address (FROM_EMAIL or SMTP_USERNAME):

  1. Send email from your email client to the configured address
  2. Subject: "I need help with my laptop refresh"
  3. Body: "Hi, I'd like to start a laptop refresh request"

Expected outcome:

  • ✓ Email received and processed by Integration Dispatcher
  • ✓ Agent responds via email with greeting and laptop information
  • ✓ Conversation maintains context across email replies
  • ✓ Agent presents laptop options for your region
  • ✓ Ticket created with confirmation sent via email

Step 5: Test Email Threading

Reply to the agent's email to test conversation threading:

  1. Reply to the agent's email (maintains In-Reply-To header)
  2. Continue the conversation: "I'd like to see available laptop options"
  3. Agent responds in the same email thread

This is an example of what the messages may look like (it will depend on your email client):

Email Example

Expected outcome:

  • ✓ Email threading works correctly
  • ✓ Conversation context maintained across multiple emails
  • ✓ Agent remembers previous messages in the thread

You should now be able to:

  • ✓ Interact with agents via email
  • ✓ Receive email notifications and responses
  • ✓ Maintain conversation context through email threads
  • ✓ Test email integration end-to-end

3.7 Run Evaluations

The evaluation framework validates agent behavior against business requirements and quality metrics. Generative AI agents are non-deterministic by nature, meaning their responses can vary across conversations even with identical inputs. Multiple different responses can all be "correct," making traditional software testing approaches insufficient. This probabilistic behavior creates unique challenges:

  • Sensitivity to Change: Small changes to prompts, models, or configurations can introduce subtle regressions that are difficult to detect through manual testing
  • Business Requirements Validation: Traditional testing can't verify that agents correctly follow domain-specific policies and business rules across varied conversations
  • Quality Assurance Complexity: Manual testing is time-consuming and can't cover the wide range of conversation paths and edge cases
  • Iterative Development: Without automated validation, it's difficult to confidently make improvements without risking regressions

The evaluation framework addresses these challenges by combining predefined test conversations with AI-generated scenarios, applying metrics to assess both conversational quality and business process compliance. This was a crucial tool in the development of this quickstart, enabling PR validation, model comparison, prompt evaluation, and identification of common conversation failures.

This section walks you through generating conversations with the deployed system and evaluating them. More detailed information on the evaluation system is in the Evaluation Framework Guide.

Step 1: Configure Evaluation Environment

Start by setting up your environment with the references to the LLM that will be used for evaluation. In most cases you will need to use a model which is as strong or stronger than the model used for the agent. We recommend that you use llama-3-3-70b-instruct-w8a8 as is recommended for the agent.

cd evaluations/

# Set LLM endpoint for evaluation (can use different model than agent)
export LLM_API_TOKEN=your-api-token
export LLM_URL=https://your-evaluation-llm-endpoint
export LLM_ID=llama-3-3-70b-instruct-w8a8

uv venv
source .venv/bin/activate
uv sync

Step 2: Run Predefined Conversation Flows

Execute the predefined conversation flows against your deployed agent:

# Run predefined conversations
python run_conversations.py

This runs the pre-defined conversations in evaluations/conversations_config/conversations/.

Expected outcome:

  • ✓ Conversations executed against deployed agent
  • ✓ Results saved to results/conversation_results/
  • ✓ Files like success-flow-1.json, edge-case-ineligible.json

Review a conversation result:

cat results/conversation_results/success-flow-1.json

You should see the complete conversation with agent responses at each turn. This is how you can test conversation flows that can be defined in advance.

Step 3: Generate Synthetic Test Conversations

In addition to pre-defined flows we want to be able to test conversations with more variability. Create additional test scenarios using the conversation generator (generate.py):

# Generate 5 synthetic conversations
python generator.py 5 --max-turns 20

Expected outcome:

  • ✓ 5 generated conversations saved to results/conversation_results/
  • ✓ Diverse scenarios with varied user inputs

Step 4: Evaluate All Conversations

Run the evaluation metrics against all conversation results:

# Evaluate with business metrics
python deep_eval.py

Expected outcome:

  • ✓ Each conversation evaluated against 15 metrics
  • ✓ Results saved to results/deep_eval_results/
  • ✓ Aggregate metrics in deepeval_all_results.json

Step 5: Review Evaluation Results

The results were displayed on the screen at the end of the run and are also stored in results/deep_eval_results/deepeval_all_results.json.

# View evaluation summary
cat results/deep_eval_results/deepeval_all_results.json

Key metrics to review:

Standard Conversational Metrics:

  • Turn Relevancy: Are responses relevant to user messages? (Threshold: > 0.8)
  • Role Adherence: Do agents stay within their roles? (Threshold: > 0.5)
  • Conversation Completeness: Were all user requests addressed? (Threshold: > 0.8)

Laptop Refresh Process Metrics:

  • Information Gathering: Did agent collect required data? (Threshold: > 0.8)
  • Policy Compliance: Did agent follow 3-year refresh policy correctly? (Threshold: > 0.8)
  • Option Presentation: Were laptop options shown correctly? (Threshold: > 0.8)
  • Process Completion: Were tickets created successfully? (Threshold: > 0.8)
  • User Experience: Was agent helpful and clear? (Threshold: > 0.8)

Quality Assurance Metrics:

  • Flow Termination: Does conversation end properly? (Threshold: > 0.8)
  • Ticket Number Validation: ServiceNow format (REQ prefix)? (Threshold: 1.0)
  • Correct Eligibility Validation: Accurate 3-year policy timeframe? (Threshold: 1.0)
  • No Errors Reported: No system problems? (Threshold: 1.0)
  • Correct Laptop Options for Location: All region-specific models presented? (Threshold: 1.0)
  • Confirmation Before Ticket Creation: Agent requests approval before creating ticket? (Threshold: 1.0)
  • Return to Router After Task Completion: Proper routing when user says no? (Threshold: > 1.0)

Each of these metrics is defined in evaluations/get_deepeval_metrics.py. Metrics tell a judge LLM how to evaluate the conversation. As an example:

        ConversationalGEval(
            name="Policy Compliance",
            threshold=0.8,
            model=custom_model,
            evaluation_params=[TurnParams.CONTENT, TurnParams.ROLE],
            evaluation_steps=[
                "First, review the laptop refresh policy in the additional context below to understand the eligibility criteria. The policy specifies how many years a laptop must be in use before it is eligible for refresh.",
                "Verify the assistant correctly applies the laptop refresh policy when determining eligibility.",
                "If the agent states the laptop age (e.g., '2 years and 11 months old', '5 years old', '3.5 years old'), verify the eligibility determination is logically accurate based on the policy in the additional context:",
                "  - Compare the stated laptop age against the refresh cycle specified in the policy",
                "  - Laptops younger than the refresh cycle should be marked as NOT eligible or not yet eligible",
                "  - Laptops that meet or exceed the refresh cycle age should be marked as eligible",
                "Check for logical contradictions: If the agent states a laptop age and eligibility status that contradict each other based on the policy (e.g., says '2 years 11 months old' but states 'eligible' when the policy requires 3 years), this is a FAILURE.",
                "Verify the assistant provides clear policy explanations when discussing eligibility.",
                f"\n\nadditional-context-start\n{default_context}\nadditional-context-end",
            ],
        ),

When metrics fail, the rationale for the failure will be explained by the judge LLM. An easy way to see an example of this is to run

python evaluate.py --check

which runs known bad conversations to validate that they are flagged as bad by the metrics. The known bad conversations are in evaluations/results/known_bad_conversation_results/. An example of a failure would be:

   ⚠️ wrong_eligibility.json: 1/15 metrics failed (as expected: False)
      Failed metrics:
        • Policy Compliance [Conversational GEval] (score: 0.000) - The conversation completely fails to meet the criteria because the assistant incorrectly determines the user's eligibility for a laptop refresh, stating the laptop is eligible when it is only 2 years and 11 months old, which is less than the 3-year refresh cycle specified in the policy.

Running python evaluate.py --check validates that your model is strong enough to catch the cases covered by the metrics. If you use a weaker model you may find that some of these conversations pass instead of failing. This option was used during development to ensure that as we changed the metrics they still worked as expected.

Step 6: Run Complete Evaluation Pipeline

In the earlier steps we ran each of the evaluation components on their own. Most often we want to run the full pipeline on a PR or after having made significant changes. You can do this with evaluate.py.

Run the full pipeline in one command (this will take a little while):

# Complete pipeline: predefined + generated + evaluation
python evaluate.py --num-conversations 5

Expected outcome:

  • ✓ Predefined flows executed
  • ✓ 5 synthetic conversations generated
  • ✓ All conversations evaluated
  • ✓ Comprehensive results report with aggregate metrics
  • ✓ Identification of failing conversations for debugging

The Makefile includes a number of targets that can be used to run evaluations either on PRs or on a scheduled basis:

# Run a quick evaluation with 1 synthetic conversation
make test-short-resp-integration-request-mgr

# Run evaluation with 20 synthetic conversations
make test-long-resp-integration-request-mgr

# Run evaluation with 4 concurrent sessions for a total of 40 synthetic conversations
make test-long-concurrent-integration-request-mgr

These targets automatically:

  • Set up the evaluation environment
  • Run predefined conversations
  • Generate synthetic conversations (1, 20, or 40 depending on target)
  • Execute all evaluation metrics
  • Display results with pass/fail status

You should now be able to:

  • ✓ Execute evaluation pipelines
  • ✓ Generate synthetic test conversations
  • ✓ Evaluate agent performance with business metrics
  • ✓ Identify areas for improvement
  • ✓ Validate agent behavior before production deployment
  • ✓ Catch regressions when updating prompts or models
  • ✓ Configure your CI to run evaluations

3.8 Follow the Flow with Tracing

Agentic systems involve complex interactions between multiple components—routing agents, specialist agents, knowledge bases, MCP servers, and external systems—making production debugging challenging without proper visibility. Distributed tracing addresses these challenges by providing:

  • End-to-End Request Visibility: Track the complete lifecycle of requests as they flow through Request Manager → Agent Service → Llama Stack → MCP Servers → External APIs
  • Agent Handoff Monitoring: Understand how routing agents hand off sessions to specialist agents and trace the decision-making process
  • Performance Analysis: Identify bottlenecks in the request flow, measure LLM inference time, and optimize knowledge base queries
  • Production Debugging: Diagnose failed external system integrations, understand conversation routing issues, and troubleshoot ticket creation failures
  • User Interaction Patterns: Analyze how users interact with the system across different channels and identify common conversation paths

The system includes OpenTelemetry support for distributed tracing across all components, enabling you to track requests end-to-end through Request Manager, Agent Service, Integration Dispatcher, MCP Servers, and Llama Stack. By integrating with OpenShift's observability stack, you gain unified monitoring across all platform components alongside your existing infrastructure metrics.

Setting up Observability Infrastructure

Before enabling distributed tracing, you need to set up an OpenTelemetry collector to receive, process, and visualize traces.

If you want more detailed information and understanding you can check out this quickstart.

For the purpose of this quickstart we've outlined two options for deploying Jaeger in order to collect traces:

  • Option 1: Simple Jaeger All-in-One (Development/Testing)
  • Option 2: OpenShift Observability with Tempo (Production)

You can use either one while following through the quickstart.

Option 1: Simple Jaeger All-in-One (Development/Testing)

This option uses an all in one image that includes the collector, storage, query service, and UI in a single container as outlined in Jaeger Getting Started Guide. It is not suitable for production as it is limited to in-memory storage as an example.

We've included a Makefile target to make it easy to install and uninstall.

To deploy run:

make jaeger-deploy NAMESPACE=$NAMESPACE

Option 2: OpenShift Observability with Tempo (Production)

For production deployments, use the Red Hat OpenShift distributed tracing platform based on Tempo with persistent storage, high availability, and multi-tenancy.

Key Steps:

  1. Install Red Hat OpenShift distributed tracing platform operator
  2. Deploy TempoStack instance with object storage (S3, Azure Blob, GCS)
  3. Create OpenTelemetry Collector to forward traces to Tempo
  4. Access Jaeger UI through the exposed route

The full steps needed to deploy are outlined in OpenShift Distributed Tracing Platform Documentation

Enabling Tracing in Your Deployment

Once your observability infrastructure is ready, enable tracing by setting the OTLP endpoint (as shown after running make jaeger-deploy if you are using Option1) and redeploy the quickstart:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://your-jaeger-url-as-provided-by-jaeger-deploy:4318
make helm-uninstall NAMESPACE=$NAMESPACE
make helm-install-test NAMESPACE=$NAMESPACE

The endpoint will be automatically propagated to all components.

Accessing and Viewing Traces uses Jaeger UI

Once tracing is enabled and traces are being exported, you can view them using the Jaeger UI. Jaeger is the distributed tracing system used to visualize request flows across all components.

Get the Jaeger UI URL:

# For Jaeger All-in-One
export JAEGER_UI_URL=$(oc get route jaeger-ui -n $NAMESPACE -o jsonpath='{.spec.host}')

# For OpenShift Tempo (uses Jaeger UI)
export JAEGER_UI_URL=$(oc get route tempo-tempo-stack-jaegerui -n openshift-tracing-system -o jsonpath='{.spec.host}')

# Open in browser
echo "Jaeger UI: https://$JAEGER_UI_URL"

View Traces in Jaeger:

  1. Generate traces by interacting with the agent (via CLI, Slack, or API) as described earlier in the quickstart. To use the CLI you can use:
# Get the request manager pod
export REQUEST_MANAGER_POD=$(oc get pod -n $NAMESPACE -l app=self-service-agent-request-manager -o jsonpath='{.items[0].metadata.name}')
   
# Start chat session with your email
oc exec -it $REQUEST_MANAGER_POD -n $NAMESPACE -- \
  python test/chat-responses-request-mgr.py \
  --user-id alice.johnson@company.com
  1. Open the Jaeger UI in your browser and select service request-manager
  2. Click "Find Traces" to see recent requests
  3. Click on a trace to view the complete flow including:
    • Request Manager → Agent Service → Llama Stack → MCP Servers
    • Knowledge base queries and tool calls
    • Performance timing for each component

Key spans to look for: POST /api/v1/requests, mcp.tool.get_employee_laptop_info, mcp.tool.open_laptop_refresh_ticket

You can also try selecting other services to find traces that involved a particular component. For example selecting snow-mcp-server and then Find Traces will show you just the traces that interacted with ServiceNow. If you have only done a single conversation that would include one to look up the employee information and one to create the laptop request.

Note that each user request and response from the agent will be in their own trace.

Troubleshooting: If traces don't appear in Jaeger, verify OTEL_EXPORTER_OTLP_ENDPOINT is set on deployments and check service logs for OpenTelemetry initialization messages

Cleaning Up:

If you are finished experimenting with traces and used option 1 to install Jaeger you can stop the Jaeger deployment by running:

unset OTEL_EXPORTER_OTLP_ENDPOINT
make jaeger-undeploy NAMESPACE=$NAMESPACE

You can also leave it running if you want to come back to look at traces later on.

Example Trace Hierarchy

A complete laptop refresh request shows spans across all services:

http.request POST /api/v1/requests (request-manager)          [120ms]
  └─ publish_event agent.request (request-manager)            [10ms]
      └─ http.request POST /agent/chat (agent-service)        [95ms]
          ├─ knowledge_base_query laptop-refresh-policy       [15ms]
          ├─ http.request POST /inference/chat (llamastack)   [65ms]
          │   └─ mcp.tool.get_employee_laptop_info            [8ms]
          │       └─ http.request GET servicenow.com/api      [6ms]
          └─ http.request POST /inference/chat (llamastack)   [12ms]
              └─ mcp.tool.open_laptop_refresh_ticket          [8ms]
                  └─ http.request POST servicenow.com/api     [6ms]

Viewing Traces with Jaeger:

Here's what a complete trace looks like in Jaeger:

Tracing Example

In this example, the overall request time (4:16) is dominated by the LLM inference call (3:53).

Or its graph representation:

Tracing Graph

Viewing Traces in OpenShift Console

If you're using OpenShift with the distributed tracing platform (Tempo), you can access traces directly through the OpenShift console under Observe → Traces:

OpenShift Observe Traces

This view provides:

  • Duration Graph: Visual timeline showing trace distribution and duration over time
  • Trace List: Filterable table of all traces with span counts, durations, and timestamps
  • Service Filtering: Ability to filter traces by service (request-manager, agent-service, llamastack, snow-mcp-server, etc.)
  • Quick Access: Click any trace to view detailed span breakdown

Detailed Trace View:

Clicking on a specific trace in the OpenShift console reveals the complete span hierarchy:

OpenShift Observe Trace Detail

This detailed view shows:

  • Waterfall Diagram: Visual representation of all spans with accurate timing and nesting
  • Service Operations: Clear breakdown of operations across request-manager, agent-service, llamastack, and MCP servers
  • Performance Bottlenecks: Easily identify which operations consume the most time (typically LLM inference)
  • Request Flow: See the complete path from initial request through agent processing, knowledge base queries, and external system calls

The OpenShift console integration provides a production-ready interface for monitoring and debugging your AI agent system without requiring separate tooling.

Understanding Trace Context Propagation

The system implements end-to-end trace context propagation:

  1. Client → Request Manager: Automatic via FastAPI instrumentation
  2. Request Manager → Agent Service: Automatic via HTTP client instrumentation
  3. Agent Service → Llama Stack: Automatic via HTTPX instrumentation
  4. Llama Stack → MCP Servers: Manual injection via tool headers (traceparent, tracestate)
  5. MCP Server → External APIs: Automatic via HTTPX instrumentation

All operations share the same trace ID, creating a complete distributed trace.

For detailed implementation information including context propagation mechanisms, decorator usage, and troubleshooting, see the Tracing Implementation Documentation.

You should now be able to:

  • ✓ Set up observability infrastructure (Jaeger or Tempo)
  • ✓ Enable tracing and access Jaeger UI
  • ✓ View and analyze distributed traces across all components
  • ✓ Identify performance bottlenecks in request flows

3.9 Trying out Smaller Prompts

By default the quickstart uses a single state large prompt which handles the full conversation flow. However, the quickstart also includes a multi-part prompt in which each of the prompts are more limited. A multi-part prompt gives you more control over the flow and may be able to be run with a smaller model, may require fewer tokens (due to the smaller prompts being sent to the model). On the other hand it may be less flexible and may only handle flows that you have planned for in advance. You can read more about the advantages and disadvantages of the two approaches in the Prompt Configuration Guide.

Step 1: Redeploy with Smaller Prompt

Redeploy with a fresh installation using the smaller prompt configuration:

# Set the smaller prompt configuration
export LG_PROMPT_LAPTOP_REFRESH=/app/agent-service/config/lg-prompts/lg-prompt-small.yaml

# Uninstall the current deployment
make helm-uninstall NAMESPACE=$NAMESPACE

# Reinstall with the new prompt configuration
make helm-install-test NAMESPACE=$NAMESPACE

Expected outcome:

  • ✓ Helm uninstall completes successfully
  • ✓ Helm install completes successfully with new prompt configuration
  • ✓ All pods start and return to Running state

Step 2: Start Interactive Chat Session

Use the CLI chat script to start an interactive conversation with the agent:

# Get the request manager pod
export REQUEST_MANAGER_POD=$(oc get pod -n $NAMESPACE -l app=self-service-agent-request-manager -o jsonpath='{.items[0].metadata.name}')

# Start interactive chat session
oc exec -it $REQUEST_MANAGER_POD -n $NAMESPACE -- \
  python test/chat-responses-request-mgr.py \
  --user-id alice.johnson@company.com

Expected outcome:

  • Chat client starts in interactive mode
  • Agent sends initial greeting
  • You see a prompt where you can type messages

Step 3: Complete Laptop Refresh Workflow

Follow this conversation flow to test the complete laptop refresh process with the multi-part prompt (same workflow as in Section 3.3):

You: I need help with my laptop refresh

Expected:

  • Agent greets you and retrieves your current laptop information
  • Agent checks your eligibility based on 3-year policy and gives you a summary

You: I would like to see available laptop options

Expected:

  • Agent presents available laptop options for your region (NA, EMEA, APAC, or LATAM)
  • You see 4 laptop options with specifications and pricing

You: I would like option 1, the Apple MacBook Air M3

Expected: Agent confirms your selection and asks for approval to create ServiceNow ticket

You: Yes, please create the ticket

Expected:

  • ServiceNow ticket created
  • Ticket number provided (format: REQ followed by digits)
  • Confirmation message with next steps

You: enter Ctrl-C twice

Expected: Chat session ends

Things to observe:

  • Response length and verbosity
  • Conversation flow and naturalness
  • Accuracy of information gathering
  • Policy compliance and decision-making

Behavioral Differences:

While the conversation may seem similar when using the two prompts, the multi-part prompt is more tightly controlled. For example, if you ask a random question it will be handled within the confines of the known states.

Multi-part prompt example:

User: what is the fastest bird in the world
Agent: Please let me know if you'd like to proceed with reviewing the available laptop options (yes/no).

Big prompt example:

User: what is the fastest bird in the world
Agent: I'm happy to help you with your question, but I specialize in laptop refresh and replacement requests. If you'd like to know more about the fastest bird in the world, I can offer to send you back to the routing agent who can connect you with the right specialist. Would you like me to do that?

With the big prompt, we had to specifically instruct it not to answer random questions, whereas the multi-part prompt naturally stays within its defined states.

Cost Comparison:

Another important consideration is token usage and cost. The multi-part prompt uses fewer total tokens since each individual prompt sent to the model is smaller, although it makes more requests to the LLM as it flows through different states. To explore this aspect, you can run evaluations as outlined in Section 3.7 Run Evaluations and compare the application tokens used by the big and multi-part prompts.

Step 6: Switch Back to Default Prompt

To return to the default prompt configuration:

# Unset the custom prompt variable
unset LG_PROMPT_LAPTOP_REFRESH

# Uninstall the current deployment
make helm-uninstall NAMESPACE=$NAMESPACE

# Reinstall with the default prompt configuration
make helm-install-test NAMESPACE=$NAMESPACE

You should now be able to:

  • ✓ Deploy with alternative prompt configurations
  • ✓ Compare agent behavior with different prompts
  • ✓ Experiment with prompt optimization
  • ✓ Understand the impact of prompt design on agent performance

3.10 Setting up Safety Shields (Optional)

Safety shields provide content moderation for AI agent interactions, validating user input and agent responses against safety policies using Llama Guard 3 or compatible models.

Depending on your model, prompting approach and trust in your end users they may also be critical for avoiding prompt injection attacks. A common model used with llama stack to prevent these types of attack is PromptGuard. The quickstart currently allows Llama Guard to be easily configured and we plan to add similar ease of use for PromptGuard in a later version as we have found that when using llama 70b the protection provided by PromptGuard is needed when using the "big" prompt as outlined in section 3.9 Trying out Smaller Prompts.

When to Enable Safety Shields

Consider enabling safety shields for:

  • Customer-facing agents: Public or external user interactions
  • Compliance requirements: Organizations with strict content policies
  • High-risk applications: Agents handling sensitive topics

Note: Safety shields come with the possibility of false positives. False positives that result in blocking input or output messages can mess up the IT process flow resulting in process failures. Common safety models like llama-guard that are designed for interaction with external users may not be suited for the content of common IT processes. We have disabled a number of the categories for which we regularly saw false positives.

In the case of an internal self-service IT agent, due to the risk of false positives we would generally avoid using Llama Guard. On the other hand, we would recommend using something like PromptGuard unless the model being used has enough built-in protections to prompt injection.

For development and testing, shields can be disabled for faster iteration.

Step 1: Setup Safety Shield Configuration

Safety shields require an OpenAI-compatible moderation API endpoint that is compatible with llama stacks shields. The quickstart supports two options for using safety shields as outlined in the sections which follow.

Option 1 - shared meta-llama/Llama-Guard-3-8B model

If you have a shared shared meta-llama/Llama-Guard-3-8B you can configured the quickstart to use it by exporting these environment variables and reinstalling

# provide information needed to access safety shields
export SAFETY=meta-llama/Llama-Guard-3-8B
export SAFETY_ID=meta-llama/Llama-Guard-3-8B
export SAFETY_URL=https://api.example.com/v1
export SAFETY_API_TOKEN=your-token

Note:

  • Replace https://api.example.com/v1 with your actual moderation API endpoint. The endpoint must support the OpenAI-compatible /v1/moderations API
  • Replace SAFETY_API_TOKEN if it is needed for the model, otherwise it can be omitted.
Option 2 - local meta-llama/Llama-Guard-3-8B model

If you don't have a shared meta-llama/Llama-Guard-3-8B and are deploying to an OpenShift AI cluster with GPUs you can alternatively use the following which will spin up a container running meta-llama/Llama-Guard-3-8B as part of the deployment. SAFETY_TOLERATION must match the taint key on GPU nodes in your OpenShift cluster (e.g., g5-gpu for nodes tainted with g5-gpu=true:NoSchedule). You will also need to provide a valid hugging face token as this is needed to download the meta-llama/Llama-Guard-3-8B model.

export HF_TOKEN=your-hugging-face-token
export SAFETY_ID=meta-llama/Llama-Guard-3-8B
export SAFETY_TOLERATION=g5-gpu
export SAFETY=llama-guard-3-8b

Step 2: Configure Agent-Level Shields

The default configuration for the laptop refresh specialist agent is to use meta-llama/Llama-Guard-3-8B if it has been enabled. If you want to use another safety shield you will need update the configurations in:

  • agent-service/config/agents/routing-agent.yaml
  • agent-service/config/agents/laptop-refresh-agent.yaml

Step 3: Deploy with Safety Shields

After configuring the environment variables from either Option 1 (shared model) or Option 2 (local model), deploy with safety shields enabled:

make helm-uninstall NAMESPACE=$NAMESPACE
make helm-install-test NAMESPACE=$NAMESPACE

NOTE: This deployment will take significantly longer if you are using Option 2 (local model) as it has to download LlamaGuard from hHuggingFace.

Step 4: Test Safety Shields

After deploying with shields enabled, test that they're working:

# Check logs for shield initialization
oc logs deployment/llamastack -n $NAMESPACE | grep -i shield

Expected output:

Defaulted container "llama-stack" out of: llama-stack, wait-for-models (init)
         shields:                                                                                                                                     
           shield_id: meta-llama/Llama-Guard-3-8B     

You can now run a conversation and see the effect of the Safety shield:

export REQUEST_MANAGER_POD=$(oc get pod -n $NAMESPACE -l app=self-service-agent-request-manager -o jsonpath='{.items[0].metadata.name}')

# Start interactive chat session
oc exec -it $REQUEST_MANAGER_POD -n $NAMESPACE -- \
  python test/chat-responses-request-mgr.py \
  --user-id alice.johnson@company.com

If necessary, remember to use reset to restart the conversation and then when you get to the laptop refresh specialist try out with some messages that could trigger the shields. For example "how would I hurt a penguin" should result in something like "I cannot help you with that".

Common Shield Categories

Llama Guard 3 checks for these categories:

  • Violent Crimes
  • Non-Violent Crimes
  • Sex-Related Crimes
  • Child Sexual Exploitation
  • Defamation
  • Specialized Advice (Financial, Medical, Legal)
  • Privacy Violations
  • Intellectual Property
  • Indiscriminate Weapons
  • Hate Speech
  • Suicide & Self-Harm
  • Sexual Content
  • Elections
  • Code Interpreter Abuse

For comprehensive safety shields documentation, see the Safety Shields Guide.

You should now be able to:

  • ✓ Configure safety shields for content moderation
  • ✓ Customize shield behavior per agent
  • ✓ Handle false positives with ignored categories
  • ✓ Monitor and troubleshoot shield operations
  • ✓ Balance safety and usability for your use case

3.11 Cleaning up

You can stop the deployed quickstart by running:

make helm-uninstall NAMESPACE=$NAMESPACE

This will remove all deployed services, pods, and resources from your namespace.


4. PERFORMANCE & SCALING

The Self-Service Agent quickstart is designed for scalability using standard Kubernetes and cloud-native patterns. All core components can be scaled using familiar Kubernetes techniques—horizontal pod autoscaling, replica sets, and resource limits—without requiring custom scaling logic or architectural changes.

Component Scaling: The quickstart's services follow standard cloud-native design principles. The services can scale both vertically (multiple uvicorn workers per pod) and horizontally (multiple pod replicas) to handle increased load. MCP servers specifically use stateless streaming HTTP so that they can scale in the same way (unlike the Server-Sent Events transport whoes state limits how you can scale).

Infrastructure Scaling: For supporting infrastructure components, apply industry-standard scaling techniques. PostgreSQL databases can leverage connection pooling, read replicas, and vertical scaling following standard PostgreSQL best practices. When using production mode with Knative Eventing, Apache Kafka benefits from standard Kafka scaling strategies including partitioning, consumer groups, and multi-broker clusters. These are well-documented patterns with extensive ecosystem support.

Performance Optimization: Analysis of some evaluation runs shows that 99.7% of request processing time is spent in Llama Stack inference, with the request-manager and event delivery adding only negligible overhead (~12ms total). This means performance optimization efforts should focus primarily on LLM inference scaling—using GPU acceleration to start and selecting appropriately-sized models. The quickstart's architecture ensures that scaling Llama Stack directly translates to end-to-end performance improvements without infrastructure bottlenecks.

For comprehensive scaling guidance, detailed performance characteristics, component-by-component scaling analysis, configuration examples for different deployment sizes, and links to Red Hat and Llama Stack documentation, see the Performance and Scaling Guide.


5. SECURITY

Security is a key aspect of production deployments. While this quickstart works to avoid common security issues, the security requirements and implementation will often be specific to your organization. A few aspects that you will need to extend the quickstart if/when you use it in production would include:

  1. Management of sensitive information in logs and traces: The quickstart does not currently redact information from logs or traces. This means that you will either need to manage access to traces and logs to account for potentially sensitive information like employee name and email address or extend it to redact information based on your organizations policies.
  2. Credential management: Credentials are set in the quickstart in order to make it easy for people to get started and easily deploy the quickstart. When deploying to production you will need to manage credentials in accordance with your organizations requirements including potentially managing them through vaults and planning for credential rotation. These more advanced techniques are not covered in the quickstart.
  3. Database, Kafka configuration: Production configuration and security hardening for components like the database and Kafka, are not covered as they will often be existing components within an organization which have already been configured and hardened to meet the organizations requirements for scaling and security.
  4. Network security: While access to pods within the deployment has been restricted by network policy to only other pods within the deployment namespace with the exception of the kafka namespace and the route which allows slack to communicate with the deployment, you should review and apply any standard network policies that your organization has for OpenShift deployments.

6. GOING DEEPER: COMPONENT DOCUMENTATION

Now that you have the system running, you can dive deeper into specific components and concepts.

For detailed component information, see the Component Guide.

Guides

Step-by-step guides for integrations, deployment, and advanced features:

Technical Documentation

Detailed technical documentation for developers:


7. CUSTOMIZING FOR YOUR USE CASE

The laptop refresh example demonstrates all components. This section guides you in adapting the quickstart for your own IT process.

Planning Your Use Case

Step 1: Define Your IT Process

Questions to answer:

  • What IT process are you automating? (PIA, RFP, access requests, etc.)
  • What are the steps a user goes through?
  • What information does the agent need to collect?
  • What systems does the agent need to interact with?
  • What policies or rules govern the process?
  • How do you measure success?

Example: Privacy Impact Assessment (PIA)

Process steps:

  1. User requests PIA assessment
  2. Agent asks about project details (name, scope, data types)
  3. Agent asks privacy-specific questions
  4. Agent evaluates risk level based on responses
  5. Agent generates PIA document
  6. Agent submits to compliance team

Step 2: Identify Required Integrations

For each external system, determine:

  • What data do you need to read?
  • What actions do you need to perform?
  • Does an API exist?
  • What authentication is required?

Example: PIA Assessment

  • Compliance system API: Submit PIA documents
  • HR system: Get employee and project info
  • Document storage: Save generated PIAs
  • Email: Notify compliance team

Step 3: Map Knowledge Requirements

What knowledge does the agent need?

  • Policy documents
  • Process guidelines
  • Templates
  • FAQs
  • Legal/compliance requirements

Example: PIA Assessment

  • Privacy laws and regulations
  • PIA question templates
  • Risk assessment criteria
  • Data classification guidelines
  • Example PIAs for reference

Step 4: Define Success Metrics

How will you evaluate the agent?

  • Process completion rate
  • Information accuracy
  • Policy compliance
  • User satisfaction
  • Time to completion

Example: PIA Assessment

  • Did agent ask all required privacy questions?
  • Was risk level assessed correctly?
  • Did generated PIA meet compliance standards?
  • Was submission successful?

8. NEXT STEPS AND ADDITIONAL RESOURCES

8.1 What You've Accomplished

By completing this quickstart, you have:

  • ✓ Deployed a fully functional AI agent system on OpenShift
  • ✓ Understood the core platform architecture and components
  • ✓ Tested the laptop refresh agent through multiple channels
  • ✓ Run evaluations to validate agent behavior
  • ✓ Learned how to customize the system for your own use cases

8.2 Recommended Next Steps

For Development Teams:

  1. Review the Contributing Guide for development setup and workflow
  2. Explore the component documentation in Section 6 for deeper technical details
  3. Review the evaluation framework to understand quality metrics
  4. Experiment with customizing the laptop refresh agent prompts
  5. Set up observability and monitoring for your deployment

For Organizations Planning Production Deployment:

  1. Plan your transition from testing mode to production mode (Knative Eventing)
  2. Identify your first use case for customization
  3. Establish evaluation criteria and quality metrics for your use case
  4. Plan integration with your existing IT service management systems

For Customizing to Your Use Case:

  1. Follow the planning guide in Section 7
  2. Review the laptop refresh implementation as a reference in the Component Guide
  3. Start with agent configuration and knowledge base development
  4. Build MCP servers for your external systems
  5. Develop use-case-specific evaluation metrics

Thank you for using the Self-Service Agent Quickstart! We hope this guide helps you successfully deploy AI-driven IT process automation in your organization.

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6