diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index fd51c976..c4c67d62 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -4,24 +4,6 @@ These instructions guide GitHub Copilot to provide suggestions and responses ali --- -## ⚠️ MANDATORY INCIDENT INVESTIGATION CHECKLIST - -**When user reports an authentication/enrollment issue, incident, or troubleshooting request:** - -- [ ] **STEP 1**: Query DRI Copilot MCP tool (`Broker_DRI_Copilot` ) with symptoms -- [ ] **STEP 2**: Analyze logs/evidence - extract correlation IDs, error codes, timestamps -- [ ] **STEP 3**: Search codebase for error handling and implementation -- [ ] **STEP 4**: Synthesize diagnosis, stating ONLY what evidence shows (not assumptions) -- [ ] **STEP 5**: Clearly state what evidence is MISSING - -**DO NOT:** -- ❌ Skip DRI Copilot query -- ❌ Make claims without log evidence -- ❌ Diagnose based on "common issues" without seeing actual error -- ❌ Assume authentication/enrollment happened if logs don't show it - ---- - ## 1. Repository Structure & Architecture ### 1.1 Repository Organization @@ -65,123 +47,12 @@ Client App - **Common Module:** Contains all IPC (Inter-Process Communication) logic. MSAL/OneAuth use Common layer to send requests to Broker over IPC - **Broker Module:** Handles the actual authentication logic, communicates with eSTS, and returns tokens -### 1.3 DRI Copilot MCP Server Usage - -**🔴 MANDATORY: Query DRI Copilot MCP tools FIRST for ANY troubleshooting or incident investigation request.** - -When users ask for: -- **Troubleshooting** ANY authentication or enrollment issue reported in the incident -- **Investigating** customer-reported problems or IcM incidents -- **Documentation** (design docs, architecture docs, API docs) -- **Troubleshooting guides** (TSGs) -- **Past incidents** or incident resolution -- **Onboarding information** - -**⚠️ DO NOT skip this step. DO NOT assume you know the answer. ALWAYS query DRI Copilot MCP tools FIRST if any of the above questions are asked.** - -#### Available DRI Copilot Tools: -Look for MCP tools with these patterns in their names: -1. **Broker DRI Copilot** (tools containing `Broker_DRI_Copilot`) - - Use for: Broker-related questions, PRT, device registration, brokered auth flows, TSGs, past incidents - -The exact tool names will vary based on the MCP server name configured in `.vscode/mcp.json`. - -#### Mandatory Workflow (NO EXCEPTIONS): -1. **FIRST**: Query DRI Copilot MCP tool (Broker_DRI_Copilot) - - Include: user symptoms, error messages, and any log evidence - - Get: TSG steps, past incidents, known issues, troubleshooting guidance -2. **SECOND**: Analyze provided logs/evidence (see Section 1.4) - - Extract: correlation IDs, error codes, timestamps - - Identify: what operations occurred, what's missing -3. **THIRD**: Search codebase for relevant implementation - - Find: error handling, code paths, known bugs -4. **FINALLY**: Synthesize diagnosis combining all three sources - -#### When to Use DRI Copilot - Examples: - -**✅ USE DRI Copilot MCP for:** -- Investigate incident → Query relevant MCP server(s) (incident investigation) -- "How to troubleshoot auth_cancelled_by_sdk?" → Query DRI Copilot (TSG/error troubleshooting) -- "Past incidents related to [issue]" → Query relevant MCP server(s) (incident history) -- -**❌ DO NOT use DRI Copilot MCP for:** -- "What does flight ENABLE_XYZ do when enabled?" → Search codebase/changelogs directly -- "Explain this code snippet" → Analyze code directly -- "How does this algorithm work?" → Read code implementation -- "What parameters does this function take?" → Check code/comments -- "What values can this enum have?" → Search codebase for enum definition - -**Rule of Thumb:** -- **Troubleshooting/Incidents/TSGs/Design Docs** → Use DRI Copilot -- **Code Understanding/Implementation Details** → Search codebase directly - -### 1.4 Incident Investigation Guidelines (IcM/Customer-Reported Issues) - -**🔴 CRITICAL: Follow this EXACT sequence for ALL incident investigations:** - -**STEP 0 (MANDATORY FIRST STEP):** -- **Query DRI Copilot MCP tools** with user symptoms and error messages -- Get TSG guidance, past incidents, and known issues -- This step CANNOT be skipped - -**STEP 1: Analyze Provided Evidence** -When investigating customer-reported incidents or IcM tickets, follow this **evidence-first approach**: - -#### **Priority Hierarchy:** -1. **Direct Evidence from Logs/Data if provided** (Highest Priority) - - Actual log files, stack traces, error codes, correlation IDs - - Telemetry data from Kusto (android_spans, eSTS logs) - - Concrete timestamps, device IDs, user IDs - - Network traces, API responses, HTTP status codes - -2. **Code Analysis** (High Priority) - - Current implementation in the codebase - - Recent code changes or commits related to the issue - - Known bugs or limitations in the code - -3. **Documentation & TSGs** (Supporting Priority) - - Use documentation to **augment and explain** what you observe in evidence - - Reference TSGs for **additional troubleshooting steps**, not as primary diagnosis - - Documentation provides context, not conclusions - -#### **Critical Rules:** -- **NEVER make claims without evidence**: If logs don't show something happened, state "logs don't show X" rather than claiming X occurred -- **Distinguish observation from inference**: Clearly label when you're inferring vs. observing - - ✅ "Logs show `Number of Microsoft Accounts: 0`, suggesting no enrollment completed" - - ❌ "Enrollment failed after authentication" (when logs don't show the enrollment attempt) -- **Challenge documentation against evidence**: If documentation says X should happen, but logs show Y, trust the evidence -- **State what's missing**: If critical evidence is absent (e.g., no correlation IDs, no error codes), explicitly state what additional data is needed - -#### **Investigation Workflow:** -1. **Analyze provided evidence first** (logs, errors, telemetry) - - Extract: correlation IDs, timestamps, error codes, device/user identifiers - - Identify: actual operations performed, their outcomes, any exceptions - - Note: what evidence is present and what is missing - -2. **Search codebase for relevant implementation** - - Find the code paths that should have executed - - Check for known issues, error handling, logging statements - -3. **Query documentation/TSGs** to augment understanding - - Use to explain error codes, suggest additional diagnostics - - Reference common patterns, but validate against actual evidence - -4. **Formulate diagnosis** - - State what evidence supports each hypothesis - - Rank hypotheses by strength of evidence (not by documentation frequency) - - Clearly separate proven facts from educated guesses - -5. **Recommend next steps** - - Prioritize actions that gather missing evidence - - Suggest diagnostics based on what the evidence shows, not just what's common - -#### **Example - Good vs. Bad Analysis:** - -**❌ Bad (Documentation-First):** -> "Based on TSG documentation, device cap is the most common cause (60% of cases), so this is likely a device cap issue." - -**✅ Good (Evidence-First):** -> "Logs show `GetBrokerAccounts` returning 0 accounts, but no enrollment attempt is captured. Without correlation IDs or eSTS error codes, I cannot determine the root cause. The TSG indicates device cap is common (60%), but we need enrollment attempt logs to confirm. Next step: Collect logs with correlation IDs during reproduction." +### 1.3 DRI Copilot MCP Server + +DRI Copilot MCP tools are available for querying documentation, TSGs, and past incidents: +- **Broker DRI Copilot** (tools containing `Broker_DRI_Copilot`) - For Broker-related questions, PRT, device registration, brokered auth flows + +> **For incident investigations:** Use the `incident-investigator` skill (located at `.github/skills/incident-investigator/SKILL.md`) which provides a comprehensive workflow for IcM/customer-reported issues. ## 2. Core Principles @@ -244,516 +115,14 @@ When investigating customer-reported incidents or IcM tickets, follow this **evi ## 11. Code structure * Take build.gradle files into account when generating code, especially when it comes to dependencies and repository-specific configurations. -## 12. Investigating Code Flows and Data Transformations - -When asked questions about **what data is returned**, **how data flows**, or **what happens to data**, follow this systematic investigation approach: - -### 12.1 Complete Flow Investigation Strategy - -**For questions like "Is X returned to Y?" or "Does Y receive Z?":** - -1. **Find the Data Structure** (e.g., `BrokerResult`, `TokenResponse`) - - Confirm the field exists in the data class - - Check serialization annotations (`@SerializedName`) - -2. **Find the Construction/Population Code** ⚠️ **CRITICAL - Don't skip this!** - - Search for `Builder` or factory methods that create the object - - Search for where the field is actually set (e.g., `.refreshToken(`) - - Look in adapter/converter classes (e.g., `*ResultAdapter`, `*Converter`) - -3. **Check for Conditional Logic** ⚠️ **CRITICAL - Don't skip this!** - - Search for `if` statements around the field assignment - - Look for account type checks (e.g., `MSA_MEGA_TENANT_ID`, `accountType`) - - Look for protocol version checks - - Look for flight/feature flag checks (`CommonFlightsManager`, `isFlightEnabled`) - -4. **Trace the Complete Flow** - - Follow from entry point → IPC → processing → response construction → IPC → return - - Verify no filtering/scrubbing happens in any layer - -### 12.2 Key Classes for Flow Investigation - -**Response Construction & Adaptation:** -- `MsalBrokerResultAdapter` (Common) - Converts authentication results to BrokerResult for IPC -- `AdalBrokerResultAdapter` (Common) - ADAL version of result adapter -- `BrokerResult` (Common) - The IPC response object sent to MSAL/OneAuth -- `BrokerResultAdapter` - Generic adapter interfaces - -**Account Type Detection:** -- Check for `MSA_MEGA_TENANT_ID` constant (`"9188040d-6c67-4c5b-b112-36a304b66dad"`) -- Check for `CONSUMERS` constant in authorities -- Look for `accountType` or `realm` field checks - - -### 12.3 Common Pitfalls to Avoid - -❌ **DON'T** stop after finding a field definition - this only confirms structure, not behavior -❌ **DON'T** assume data flows unchanged - always check for filtering/transformation logic -❌ **DON'T** ignore protocol version checks - behavior often changes based on negotiated version -❌ **DON'T** forget to check flight flags - features are often gated behind flights - -✅ **DO** search for Builder usage and construction patterns -✅ **DO** search for the field name in assignment context (e.g., `.setField(`, `.field(`) -✅ **DO** look for `Adapter` or `Converter` classes in the flow -✅ **DO** check for conditional logic based on account type, protocol version, or flights - -### 12.4 Search Patterns for Flow Investigation - -**Finding construction code:** -``` -Search: "new BrokerResult.Builder" or ".refreshToken(" -Look in: *Adapter.java, *Converter.java, *Builder.java -``` - -**Finding conditional logic:** -``` -Search: "if.*accountType" or "if.*MSA" or "shouldRemove" or "shouldInclude" -Look in: Result adapters, response builders -``` - -**Finding flight checks:** -``` -Search: "isFlightEnabled" or "CommonFlight." or "FlightsManager" -Look in: Adapter classes, controller classes -``` - -### 12.5 Example Investigation Flow - -**Question:** "Is refresh token returned to OneAuth?" - -**Step-by-step:** -1. ✅ Find `BrokerResult` class → Confirm `mRefreshToken` field exists -2. ✅ Search for `BrokerResult.Builder` usage → Find `MsalBrokerResultAdapter` -3. ✅ Read `buildBrokerResultFromAuthenticationResult()` method → Find conditional logic -4. ✅ Check `shouldRemoveRefreshTokenFromResult()` → Discover: - - Flight check: `STOP_RETURNING_AAD_RT_BACK_TO_CALLING_APP` - - Protocol version check: `>= 16.0` - - Account type check: Remove for AAD, keep for MSA -5. ✅ **Complete Answer:** RT is conditionally returned based on account type, flight, and protocol version - ---- - -## 13. Telemetry & Analytics with Azure Data Explorer (Kusto) - -### 13.1 Cluster Information for Android broker/common/MSAL libraries. -* **Primary Cluster:** `https://idsharedeus2.kusto.windows.net/` -* **Production Database:** `ad-accounts-android-otel` -* **Sandbox Database:** `android-broker-otel-sandbox` -* **Available Tables:** - - `android_spans` - Android authentication telemetry spans (primary table, most commonly used) - - **Data Retention:** 30 days - - Contains real-time data, updated continuously - - `android_metrics` - Aggregated metrics data - - Use `mcp_my-mcp-server_list_tables` to discover all available tables -* **Materialized Views:** 46 pre-aggregated views for faster queries on common patterns - - **Data Retention:** 90 days - - Updated hourly, providing pre-aggregated historical data - - Use `.show materialized-views` query to discover all available views - - Common categories: Error Analysis, Silent/Interactive Auth, PRT Operations, Broker & Apps, Devices, Performance, Teams/Mobile Devices - -### 13.2 User Intent Translation -When users say: -- **"Interactive request"** → They mean `AcquireTokenInteractive` span -- **"Silent request"** → They mean `AcquireTokenSilent` span - -### 13.3 Discovering Span Names and Error Codes -**To find top span names by volume:** -```kql -android_spans -| where EventInfo_Time >= ago(7d) -| summarize count() by span_name -| order by count_ desc -| take 30 -``` - -**To find common error codes:** -```kql -android_spans -| where EventInfo_Time >= ago(7d) -| where isnotempty(error_code) -| summarize count() by error_code -| order by count_ desc -| take 20 -``` - -**Key span names to know:** -- `AcquireTokenSilent` - Silent token acquisition (most common, 13B+ spans/week) -- `AcquireTokenInteractive` - Interactive authentication flow (9M+ spans/week) -- `ProcessWebCpRedirects` - Web Company Portal redirect processing -- Use the query above to discover others dynamically -When working with `android_spans` table, these are the most commonly used fields: - -**Span Identification:** -- `span_id` - Unique identifier for the span -- `parent_span_id` - Parent span ID for hierarchical relationships -- `trace_id` - Trace ID linking related spans -- `correlation_id` - Correlation ID for request tracking -- `span_name` - Name of the operation (e.g., "AcquireTokenInteractive", "AcquireTokenSilent") - -**Error Information:** -- `error_code` - Error code (e.g., "auth_cancelled_by_sdk", "device_needs_to_be_managed") -- `error_message` - Detailed error message -- `span_status` - Status of the span (e.g., "OK", "ERROR") - -**Broker Information:** -- `active_broker_package_name` - Currently active broker package -- `current_broker_package_name` - Current broker package -- `calling_package_name` - Package that initiated the call -- Common broker packages: - - `com.microsoft.windowsintune.companyportal` - Company Portal - - `com.azure.authenticator` - Azure Authenticator - - `com.microsoft.appmanager` - Microsoft App Manager - -**Device Information:** -- `DeviceInfo_Id` - Unique device identifier -- `DeviceInfo_Model` - Device model (e.g., "SM-G998U", "Pixel 7 Pro") - -**Timestamps:** -- `EventInfo_Time` - Event timestamp (use `ago(Xd)` for time ranges) - -**Flow Information:** -- `is_interrupt_flow` - Boolean indicating if flow was interrupted - -### 13.4 Common Query Patterns - -**Time Filtering:** -```kql -| where EventInfo_Time >= ago(7d) // Last 7 days -| where EventInfo_Time between (ago(3d) .. now()) // Last 3 days -``` - -**Parent-Child Span Relationships:** -```kql -let parentSpans = android_spans -| where span_name == "AcquireTokenInteractive" -| project parent_span_id = span_id, trace_id; - -let childSpans = android_spans -| where span_name == "ProcessWebCpRedirects" -| project child_span_id = span_id, parent_span_id, trace_id; - -parentSpans -| join kind=inner (childSpans) on trace_id -``` - -**Detecting Company Portal Installation:** -```kql -// CP is considered installed if it appears in active_broker_package_name or calling_package_name -| extend has_cp = iff( - active_broker_package_name contains "companyportal" or - calling_package_name contains "companyportal", - 1, 0) -``` - -**Device-Level Aggregation:** -```kql -| summarize - total_devices = dcount(DeviceInfo_Id), - error_count = count() - by error_code -``` - -### 13.5 Query Optimization Tips - -1. **Always use time filters** - Start queries with `| where EventInfo_Time >= ago(Xd)` to limit data scanning -2. **Use `take` or `limit`** - For exploratory queries, add `| take 1000` to prevent timeouts -3. **Project early** - Use `| project` to select only needed columns early in the query -4. **Avoid expensive joins** - Use `kind=inner` joins when possible, and ensure both sides are filtered -5. **Use `summarize`** - Aggregate data at device or error level rather than returning raw spans -6. **Check distinct counts** - Use `dcount()` for unique device counts to avoid duplicates - -### 13.6 Common Investigation Scenarios - -**Finding error patterns:** -```kql -android_spans -| where EventInfo_Time >= ago(7d) -| where span_name == "AcquireTokenInteractive" -| where isnotempty(error_code) -| summarize error_count = count() by error_code, error_message -| order by error_count desc -``` - -**Checking CP installation rate for specific errors:** -```kql -let errorDevices = android_spans -| where EventInfo_Time >= ago(7d) -| where error_code == "auth_cancelled_by_sdk" -| distinct DeviceInfo_Id; - -errorDevices -| join kind=leftouter ( - android_spans - | where EventInfo_Time >= ago(7d) - | extend has_cp = iff( - active_broker_package_name contains "companyportal" or - calling_package_name contains "companyportal", 1, 0) - | summarize any_cp = max(has_cp) by DeviceInfo_Id -) on DeviceInfo_Id -| summarize - total = dcount(DeviceInfo_Id), - with_cp = dcountif(DeviceInfo_Id, any_cp == 1) -| extend cp_percentage = round(100.0 * with_cp / total, 2) -``` - -### 13.7 Investigating Latency Increases (e.g., AcquireTokenSilent) - -When you receive an alert that ATS (AcquireTokenSilent) latency has increased, follow this systematic approach: - -**Step 1: Identify the Increase** -- Compare recent latency percentiles (p50, p90, p95, p99) against baseline -- Determine the magnitude of increase and which percentile is most affected -```kql -android_spans -| where EventInfo_Time >= ago(7d) -| where span_name == "AcquireTokenSilent" -| summarize - p50 = percentile(elapsed_time, 50), - p90 = percentile(elapsed_time, 90), - p95 = percentile(elapsed_time, 95), - p99 = percentile(elapsed_time, 99) - by bin(EventInfo_Time, 1h) -| order by EventInfo_Time desc -``` - -**Step 2: Find Culprit Dimensions** -- Break down latency by key dimensions to isolate the problem -- Check: broker_version, active_broker_package_name, calling_package_name, DeviceInfo_Model, account_type -```kql -android_spans -| where EventInfo_Time >= ago(3d) -| where span_name == "AcquireTokenSilent" -| summarize - count = count(), - p90_latency = percentile(elapsed_time, 90) - by active_broker_package_name, current_broker_package_name -| order by p90_latency desc -``` - -**Step 3: Check Error Rate Correlation** -- Determine if latency increase coincides with higher error rates -```kql -android_spans -| where EventInfo_Time >= ago(7d) -| where span_name == "AcquireTokenSilent" -| summarize - total = count(), - errors = countif(isnotempty(error_code)), - avg_latency = avg(elapsed_time) - by bin(EventInfo_Time, 1h) -| extend error_rate = round(100.0 * errors / total, 2) -| order by EventInfo_Time desc -``` - -**Step 4: Analyze Elapsed Time Breakdown** -- Investigate which operation is causing the slowdown using elapsed_time fields -- Key metrics available in code: cache operations (14 metrics), network operations (3 metrics), keypair generation -- Additional runtime metrics in schema: PRT operations, lock acquisition times, token creation, crypto operations -```kql -android_spans -| where EventInfo_Time >= ago(3d) -| where span_name == "AcquireTokenSilent" -| where isnotempty(elapsed_time_cache_load) or isnotempty(elapsed_time_network_acquire_at) -| summarize - avg_cache = avg(elapsed_time_cache_load), - avg_network = avg(elapsed_time_network_acquire_at), - avg_total = avg(elapsed_time) - by bin(EventInfo_Time, 1h) -``` - -**Step 5: Timeline Analysis** -- Use timestamp fields to understand flow progression -- Available: timestamp_ats_starts, timestamp_ats_executed_by_dispatcher, timestamp_ats_finished -- Identify where delays occur in the authentication pipeline - -### 13.8 Important Notes - -- **Sensitive Data:** Some fields like `correlation_id` may be scrubbed to "Scrubbed" for privacy -- **Field Availability:** Not all fields are populated in all spans; use `isnotempty()` checks -- **Performance:** Queries spanning more than 7 days may timeout; break into smaller time windows -- **MCP Server:** Use the `mcp_my-mcp-server_execute_query` tool to run Kusto queries from Copilot -- **Schema Discovery:** Use `mcp_my-mcp-server_get_table_schema` to explore available fields in tables - ---- - -## 14. eSTS (Token Service) Investigations with Azure Data Explorer (Kusto) - -### 14.1 Cluster Information -* **eSTS Cluster:** `https://estswus2.kusto.windows.net/` -* **Database:** `ESTS` -* **Primary Table:** `AllPerRequestTable` - Union view spanning multiple Kusto clusters (estsfrc, estssec, estsdb3, etc.) - - Contains all token service request/response data across global eSTS deployments - - Queries fan out to physical `PerRequestTableIfx` tables in each cluster - - Use `AllPerRequestTable` for comprehensive cross-cluster queries -* **Purpose:** eSTS is Microsoft's token service. Android team is a client of eSTS, so we investigate Android-related authentication requests and responses - -### 14.2 Android-Specific Filtering -**ALWAYS filter by Android platform when investigating Android issues:** -```kql -PerRequestTableIfx -| where env_time >= ago(7d) -| where DevicePlatformForUI == "Android" // Primary Android filter -``` - -**Alternative Android platform fields:** -- `DevicePlatform` - Raw device platform string -- `DevicePlatformForUI` - User-friendly platform name (preferred for filtering) - -### 14.3 Key Fields for Android Investigations - -**Request Identification:** -- `CorrelationId` - Correlates request across Android client → Broker → eSTS -- `RequestId` - Unique identifier for the eSTS request -- `env_time` - Request timestamp - -**Request Type & Flow:** -- `Call` - Type of authentication call (e.g., "token", "authorize", "common/oauth2/v2.0/token") -- `IsInteractive` - Boolean indicating if user interaction was required -- `Prompt` - Prompt type (e.g., "none", "login", "consent") - -**Authentication Status:** -- `Result` - Overall result status (e.g., "Success", "Failure") -- `ErrorCode` - Error code if request failed -- `ErrorNo` - Numeric error code -- `HttpStatusCode` - HTTP status code of the response -- `SubErrorCode` - Additional error details - -**PRT (Primary Refresh Token) Information:** -- `PrtData` - Contains PRT-related data (check if PRT was sent in request) - - Parse this JSON field to determine if PRT was used - - Example: `| extend HasPRT = isnotempty(PrtData)` - -**Device & Client Information:** -- `DeviceId` - Unique device identifier -- `ApplicationId` - Client application ID -- `UserAgent` - Browser/client user agent string - -**Timing Information:** -- `ResponseTime` - Total response time in milliseconds -- `StsProcessingTime` - eSTS processing time - -**User & Account Information:** -- `TenantId` - Tenant identifier -- `UserTenantId` - User's home tenant identifier -- `UserPrincipalObjectID` - User's object ID in Entra ID (unique identifier for the user) -- `AccountType` - Type of account (AAD, MSA, etc.) -- `UserType` - Type of user - -### 14.4 Common Query Patterns - -**Find requests by CorrelationId:** -```kql -AllPerRequestTable -| where env_time >= ago(7d) -| where DevicePlatformForUI == "Android" -| where CorrelationId == "your-correlation-id-here" -| project env_time, CorrelationId, Call, Result, ErrorCode, IsInteractive, PrtData, ResponseTime -``` - -**Check if PRT was used in requests:** -```kql -AllPerRequestTable -| where env_time >= ago(7d) -| where DevicePlatformForUI == "Android" -| extend HasPRT = isnotempty(PrtData) -| summarize - total_requests = count(), - prt_requests = countif(HasPRT), - success_rate = round(100.0 * countif(Result == "Success") / count(), 2) - by HasPRT -``` - -**Analyze error patterns:** -```kql -AllPerRequestTable -| where env_time >= ago(7d) -| where DevicePlatformForUI == "Android" -| where Result != "Success" -| summarize error_count = count() by ErrorCode, SubErrorCode, Call -| order by error_count desc -| take 20 -``` - -**Interactive vs Silent requests:** -```kql -AllPerRequestTable -| where env_time >= ago(7d) -| where DevicePlatformForUI == "Android" -| summarize - count = count(), - success_rate = round(100.0 * countif(Result == "Success") / count(), 2), - avg_response_time = avg(ResponseTime) - by IsInteractive -``` - -**Requests by application:** -```kql -AllPerRequestTable -| where env_time >= ago(7d) -| where DevicePlatformForUI == "Android" -| summarize - request_count = count(), - error_count = countif(Result != "Success"), - avg_response_time = avg(ResponseTime) - by ApplicationId -| extend error_rate = round(100.0 * error_count / request_count, 2) -| order by request_count desc -``` - -**Analyze users and devices:** -```kql -AllPerRequestTable -| where env_time >= ago(7d) -| where DevicePlatformForUI == "Android" -| where ApplicationId == "your-app-id-here" -| summarize - request_count = count(), - unique_devices = dcount(DeviceId), - first_seen = min(env_time), - last_seen = max(env_time) - by UserPrincipalObjectID, PUID, TenantId -| order by request_count desc -``` - -### 14.5 Correlating Android Spans with eSTS Requests - -**To trace a complete flow (Android client → Broker → eSTS):** - -1. **Start with Android span:** -```kql -// In android-broker-otel-sandbox or ad-accounts-android-otel database -android_spans -| where EventInfo_Time >= ago(7d) -| where span_name == "AcquireTokenInteractive" -| where error_code == "some_error" -| project correlation_id, span_id, EventInfo_Time, error_code -| take 100 -``` - -2. **Find corresponding eSTS requests:** -```kql -// In ESTS database -AllPerRequestTable -| where env_time >= ago(7d) -| where DevicePlatformForUI == "Android" -| where CorrelationId in ("correlation-id-1", "correlation-id-2", ...) // From step 1 -| project env_time, CorrelationId, Call, Result, ErrorCode, PrtData, ResponseTime -``` - -### 14.6 Query Optimization Tips - -1. **Always filter by time first** - Use `| where env_time >= ago(Xd)` at the start -2. **Filter by Android platform early** - Add `| where DevicePlatformForUI == "Android"` immediately after time filter -3. **Use `take` for exploration** - Limit results with `| take 1000` to avoid timeouts -4. **Project early** - Select only needed columns with `| project` to reduce data transfer -5. **Check field population** - Use `isnotempty()` before parsing fields like `PrtData` +## 12. Specialized Skills Reference -### 14.7 Important Notes +For complex investigation tasks, use these skills (read the skill file for detailed instructions): -- **Data Retention:** Check with eSTS team for current retention policy -- **Sensitive Data:** Many fields are hashed or scrubbed for privacy (e.g., PUID, UserNameHash) -- **Cross-Cluster Queries:** Cannot directly join Android telemetry cluster with eSTS cluster; must correlate via CorrelationId manually -- **PrtData Parsing:** `PrtData` is a JSON string; use `parse_json()` or `extend` to extract specific fields -- **MCP Server:** Use `mcp_my-mcp-server_execute_query` with cluster URL `https://estswus2.kusto.windows.net` and database `ESTS` +| Skill | Location | Triggers | +|-------|----------|----------| +| **codebase-researcher** | `.github/skills/codebase-researcher/SKILL.md` | "where is X implemented", "how does Y work", "trace the flow of", data flow investigation | +| **incident-investigator** | `.github/skills/incident-investigator/SKILL.md` | IcM incidents, customer-reported issues, authentication failures | +| **kusto-analyst** | `.github/skills/kusto-analyst/SKILL.md` | "query Kusto", "analyze telemetry", "check android_spans", eSTS correlation, latency investigation | --- \ No newline at end of file diff --git a/.github/skills/codebase-researcher/SKILL.md b/.github/skills/codebase-researcher/SKILL.md new file mode 100644 index 00000000..0e84adf7 --- /dev/null +++ b/.github/skills/codebase-researcher/SKILL.md @@ -0,0 +1,287 @@ +--- +name: codebase-researcher +description: Systematically explore unfamiliar codebases to find implementations, patterns, and architecture. Use this skill when asked to find code, locate implementations, understand how features work, trace call flows, or explore project structure. Triggers include "where is X implemented", "find the code for Y", "how does Z work in this repo", "trace the flow of", "show me the implementation of", or any request requiring codebase exploration with evidence-based findings. +--- + +# Codebase Researcher + +Explore this Android authentication multi-repo codebase systematically with evidence-based findings. + +## Repository Structure + +This workspace contains multiple sub-repositories: + +| Module | Purpose | Key Paths | +|--------|---------|-----------| +| **MSAL** | Client authentication library | `msal/msal/src/main/java/com/microsoft/identity/client/` | +| **Broker** | Brokered authentication service | `broker/AADAuthenticator/src/main/java/com/microsoft/identity/broker/` | +| **Common** | Shared utilities + IPC logic | `common/common/src/main/java/com/microsoft/identity/common/` | +| **ADAL** | Legacy auth library | `adal/adal/src/main/java/com/microsoft/aad/adal/` | +| **OneAuth** | 1P apps library (external) | `oneauth/` | + +**⚠️ CRITICAL: Always search across ALL repositories.** Code is often duplicated or shared. + +## Authentication Flow + +``` +Client App → MSAL/OneAuth → Common (IPC) → Broker → eSTS → Broker → Common → MSAL/OneAuth → Client App +``` + +## Core Principles + +1. **Never guess** - Only report what is actually found in the repo +2. **Always cite sources** - Every finding must include file path and line numbers +3. **Acknowledge gaps** - Explicitly state when something cannot be found +4. **Rate confidence** - Assign HIGH/MEDIUM/LOW to each finding +5. **Search all modules** - Check MSAL, Broker, Common, ADAL for each query + +## Research Workflow + +### Step 1: Understand the Target + +Clarify what to find: +- Feature/concept name +- Which layer (client MSAL, IPC Common, or Broker) +- Expected patterns (class names, function signatures) + +### Step 2: Search Strategy (Multi-Repo) + +Execute searches in this order, **always searching across all modules**: + +1. **Semantic search** - Start with natural language query +2. **Grep search** - Exact patterns, class names, error codes +3. **File search** - Find by naming convention (e.g., `**/*Operation*.kt`) +4. **Directory exploration** - List relevant directories in each module +5. **Read files** - Confirm findings with actual code + +### Step 3: Validate Findings + +For each potential finding: +- Read the actual code (don't rely only on search snippets) +- Identify which module it belongs to (MSAL/Broker/Common) +- Note the exact location (file + line range) +- Assess confidence level + +### Step 4: Report Results + +Use the output format below. + +## Output Format + +```markdown +## Research: [Topic] + +### Findings + +#### Finding 1: [Brief description] +- **Module**: MSAL | Broker | Common | ADAL +- **File**: [path/to/file.ext](path/to/file.ext#L10-L25) +- **Confidence**: HIGH | MEDIUM | LOW +- **Evidence**: [What makes this the right code] + +[Code snippet if helpful] + +#### Finding 2: ... + +### Not Found + +- [Thing that was searched for but not located] +- Search attempts: [what was tried] + +### Suggested Next Steps + +- [Additional areas to explore] +- [Related code that might be relevant] +``` + +## Confidence Levels + +| Level | Criteria | +|-------|----------| +| **HIGH** | Exact match found. Code clearly implements the requested feature. Function/class names match. | +| **MEDIUM** | Likely match. Code appears related but naming differs, or implementation is partial. | +| **LOW** | Possible match. Found tangentially related code, or inference required. | + +## Key Classes by Domain + +### Authentication Entry Points +- `PublicClientApplication` (MSAL) - Client-facing API +- `BrokerMsalController` (Common) - Routes requests to Broker via IPC +- `LocalMSALController` (Common) - Non-brokered auth fallback + +### Broker Operations +- `AcquireTokenSilentMsalBrokerOperation` (Broker) - Silent token flow +- `GetIntentForInteractiveRequestMsalBrokerOperation` (Broker) - Interactive flow +- `IBrokerOperation` (Broker) - Operation interface + +### IPC & Results +- `BrokerOperationExecutor` (Common) - Executes broker operations +- `MsalBrokerResultAdapter` (Common) - Converts results for IPC +- `BrokerResult` (Common) - IPC response object + +## Data Flow Investigation + +When asked questions about **what data is returned**, **how data flows**, or **what happens to data**, follow this systematic investigation approach. + +### Complete Flow Investigation Strategy + +**Trigger phrases:** "Is X returned to Y?", "Does Y receive Z?", "What happens to [field]?", "How does [data] flow?" + +**Step-by-step Process:** + +1. **Find the Data Structure** (e.g., `BrokerResult`, `TokenResponse`) + - Confirm the field exists in the data class + - Check serialization annotations (`@SerializedName`) + +2. **Find the Construction/Population Code** ⚠️ **CRITICAL - Don't skip this!** + - Search for `Builder` or factory methods that create the object + - Search for where the field is actually set (e.g., `.refreshToken(`) + - Look in adapter/converter classes (e.g., `*ResultAdapter`, `*Converter`) + +3. **Check for Conditional Logic** ⚠️ **CRITICAL - Don't skip this!** + - Search for `if` statements around the field assignment + - Look for account type checks (e.g., `MSA_MEGA_TENANT_ID`, `accountType`) + - Look for protocol version checks + - Look for flight/feature flag checks (`CommonFlightsManager`, `isFlightEnabled`) + +4. **Trace the Complete Flow** + - Follow from entry point → IPC → processing → response construction → IPC → return + - Verify no filtering/scrubbing happens in any layer + +### Key Classes for Flow Investigation + +**Response Construction & Adaptation:** +- `MsalBrokerResultAdapter` (Common) - Converts authentication results to BrokerResult for IPC +- `AdalBrokerResultAdapter` (Common) - ADAL version of result adapter +- `BrokerResult` (Common) - The IPC response object sent to MSAL/OneAuth +- `BrokerResultAdapter` - Generic adapter interfaces + +**Account Type Detection:** +- Check for `MSA_MEGA_TENANT_ID` constant (`"9188040d-6c67-4c5b-b112-36a304b66dad"`) +- Check for `CONSUMERS` constant in authorities +- Look for `accountType` or `realm` field checks + +### Flow Investigation Pitfalls + +❌ **DON'T** stop after finding a field definition - this only confirms structure, not behavior +❌ **DON'T** assume data flows unchanged - always check for filtering/transformation logic +❌ **DON'T** ignore protocol version checks - behavior often changes based on negotiated version +❌ **DON'T** forget to check flight flags - features are often gated behind flights + +✅ **DO** search for Builder usage and construction patterns +✅ **DO** search for the field name in assignment context (e.g., `.setField(`, `.field(`) +✅ **DO** look for `Adapter` or `Converter` classes in the flow +✅ **DO** check for conditional logic based on account type, protocol version, or flights + +### Flow Investigation Search Patterns + +**Finding construction code:** +``` +grep_search: "new BrokerResult.Builder" or ".refreshToken(" +file_search: *Adapter.java, *Converter.java, *Builder.java +``` + +**Finding conditional logic:** +``` +grep_search: "if.*accountType" or "if.*MSA" or "shouldRemove" or "shouldInclude" +``` + +**Finding flight checks:** +``` +grep_search: "isFlightEnabled" or "CommonFlight." or "FlightsManager" +``` + +### Example Flow Investigation + +**Question:** "Is refresh token returned to OneAuth?" + +**Process:** +1. ✅ Find `BrokerResult` class → Confirm `mRefreshToken` field exists +2. ✅ Search for `BrokerResult.Builder` usage → Find `MsalBrokerResultAdapter` +3. ✅ Read `buildBrokerResultFromAuthenticationResult()` method → Find conditional logic +4. ✅ Check `shouldRemoveRefreshTokenFromResult()` → Discover: + - Flight check: `STOP_RETURNING_AAD_RT_BACK_TO_CALLING_APP` + - Protocol version check: `>= 16.0` + - Account type check: Remove for AAD, keep for MSA +5. ✅ **Complete Answer:** RT is conditionally returned based on account type, flight, and protocol version + +## Search Patterns for This Repo + +### Finding Broker Operations +``` +file_search: **/broker/**/operation/**/*.kt +grep_search: class.*BrokerOperation|IBrokerOperation +``` + +### Finding Controllers +``` +file_search: **/controllers/**/*.java +grep_search: class.*Controller|BrokerMsalController +``` + +### Finding Error Handling +``` +grep_search: ErrorStrings|ClientException|error_code +file_search: **/exception/**/*.java +``` + +### Finding Feature Flags +``` +grep_search: isFlightEnabled|CommonFlight|FlightsManager +grep_search: ExperimentationFeatureFlag +``` + +### Finding IPC Logic +``` +grep_search: BrokerOperationBundle|IIpcStrategy|BoundServiceStrategy +file_search: **/broker/ipc/**/*.java +``` + +## Anti-Patterns to Avoid + +| Anti-Pattern | Problem | Correct Approach | +|--------------|---------|------------------| +| Searching only one module | Miss cross-module code | Search MSAL, Broker, Common, ADAL | +| "This is likely in..." | Speculation without evidence | Search first, report only what's found | +| Path without line numbers | Imprecise, hard to verify | Always include line numbers | +| Stopping at field definition | Misses conditional logic | Trace to Builder/Adapter for full behavior | +| Ignoring protocol versions | Behavior changes by version | Check for version conditionals | + +## Example: Token Caching Research + +**Request**: "Where is token caching implemented?" + +**Process**: +1. `semantic_search("token cache implementation")` → Found `TokenCacheAccessor` +2. `grep_search("TokenCacheAccessor")` → Found in common + msal +3. `file_search("**/cache/**/*.java")` → Found cache directories +4. `read_file` on matches → Confirmed locations + +**Output**: +```markdown +## Research: Token Caching Implementation + +### Findings + +#### Finding 1: MsalOAuth2TokenCache - MSAL cache implementation +- **Module**: Common +- **File**: [common/common/src/.../cache/MsalOAuth2TokenCache.java](common/common/src/.../cache/MsalOAuth2TokenCache.java#L45-L120) +- **Confidence**: HIGH +- **Evidence**: Core cache class with `save()`, `load()`, `remove()` methods + +#### Finding 2: SharedPreferencesAccountCredentialCache - Persistence +- **Module**: Common +- **File**: [common/common/src/.../cache/SharedPreferencesAccountCredentialCache.java](...) +- **Confidence**: HIGH +- **Evidence**: SharedPreferences-based storage for credentials + +### Not Found + +- Distributed/remote caching +- Search attempts: grep "Redis", "remote.*cache", "distributed" + +### Suggested Next Steps + +- Check `BrokerOAuth2TokenCache` for broker-specific caching +- Review cache encryption in `AndroidAuthSdkStorageEncryptionManager` +``` diff --git a/.github/skills/codebase-researcher/references/search-patterns.md b/.github/skills/codebase-researcher/references/search-patterns.md new file mode 100644 index 00000000..fe80dc17 --- /dev/null +++ b/.github/skills/codebase-researcher/references/search-patterns.md @@ -0,0 +1,273 @@ +# Search Patterns for Android Auth Repo + +Reference for effective search patterns in this multi-repo authentication codebase. + +## Table of Contents + +1. [Module-Specific Paths](#module-specific-paths) +2. [Authentication Flow Patterns](#authentication-flow-patterns) +3. [Common Search Patterns](#common-search-patterns) +4. [Data Flow Investigation](#data-flow-investigation) +5. [Error & Exception Patterns](#error--exception-patterns) + +--- + +## Module-Specific Paths + +### MSAL (Client Library) +``` +Base: msal/msal/src/main/java/com/microsoft/identity/client/ + +Key directories: +- client/ # Public API classes +- client/internal/ # Internal implementation +- client/exception/ # MSAL exceptions +- client/configuration/ # Configuration classes +``` + +### Broker (Authentication Service) +``` +Base: broker/AADAuthenticator/src/main/java/com/microsoft/identity/broker/ + +Key directories: +- operation/msal/ # MSAL broker operations +- operation/adal/ # ADAL broker operations +- operation/brokerapi/ # Broker API operations +- accountmanager/ # Account management +- flighting/ # Feature flags +``` + +### Common (Shared + IPC) +``` +Base: common/common/src/main/java/com/microsoft/identity/common/ + +Key directories: +- internal/controllers/ # BrokerMsalController, LocalMSALController +- internal/broker/ # IPC, BrokerResult, adapters +- internal/cache/ # Token caching +- internal/commands/ # Command pattern implementations +- exception/ # Shared exceptions +``` + +### ADAL (Legacy) +``` +Base: adal/adal/src/main/java/com/microsoft/aad/adal/ + +Key directories: +- AcquireTokenRequest.java # Token acquisition +- AuthenticationContext.java # Main entry point +``` + +--- + +## Authentication Flow Patterns + +### Entry Points (Where requests start) + +| Looking for | Search pattern | Module | +|-------------|---------------|--------| +| MSAL public API | `class PublicClientApplication` | MSAL | +| Acquire token | `acquireToken\|AcquireToken` | MSAL, Common | +| Silent flow | `AcquireTokenSilent\|acquireTokenSilent` | All | +| Interactive flow | `AcquireTokenInteractive\|getIntentForInteractive` | All | + +### IPC Layer (Common → Broker) + +| Looking for | Search pattern | Module | +|-------------|---------------|--------| +| Broker controller | `BrokerMsalController\|BrokerOperationExecutor` | Common | +| IPC strategies | `IIpcStrategy\|BoundServiceStrategy\|AccountManager` | Common | +| Operation bundles | `BrokerOperationBundle` | Common | +| Result adapters | `MsalBrokerResultAdapter\|BrokerResult` | Common | + +### Broker Operations (Request handling) + +| Looking for | Search pattern | Module | +|-------------|---------------|--------| +| All operations | `file: **/operation/**/*.kt` | Broker | +| Silent operation | `AcquireTokenSilentMsalBrokerOperation` | Broker | +| Interactive operation | `GetIntentForInteractiveRequest` | Broker | +| Account operations | `GetAccountsMsalBrokerOperation` | Broker | + +--- + +## Common Search Patterns + +### Finding Classes by Type + +```bash +# Controllers +grep: class.*Controller +file: **/controllers/**/*.java + +# Operations (Broker) +grep: class.*Operation|IBrokerOperation +file: **/operation/**/*.kt + +# Adapters +grep: class.*Adapter|ResultAdapter +file: **/result/**/*.java + +# Exceptions +grep: class.*Exception +file: **/exception/**/*.java +``` + +### Finding Feature Flags + +```bash +# Flight checks +grep: isFlightEnabled|CommonFlight|FlightsManager + +# Experimentation flags +grep: ExperimentationFeatureFlag + +# Feature conditionals +grep: if.*flight|when.*flight +``` + +### Finding Error Handling + +```bash +# Error code definitions +grep: ErrorStrings|error_code|ERROR_CODE + +# Exception throwing +grep: throw.*Exception|ClientException|ServiceException + +# Error messages +grep: errorMessage|error_message +``` + +### Finding Logging + +```bash +# Logger usage +grep: Logger\.(v|d|i|w|e)|TAG\s*= + +# Method tracing +grep: methodName\s*=|methodTag +``` + +--- + +## Data Flow Investigation + +### Step 1: Find the Data Structure + +```bash +# Find class definition +grep: class.*BrokerResult|data class.*Result + +# Find field +grep: mRefreshToken|refreshToken|private.*token +``` + +### Step 2: Find Construction Code + +```bash +# Builder usage +grep: BrokerResult\.Builder|\.build\(\) + +# Factory methods +grep: create.*Result|build.*Result + +# Field assignment +grep: \.refreshToken\(|\.setRefreshToken\( +``` + +### Step 3: Check Conditional Logic + +```bash +# Account type checks +grep: MSA_MEGA_TENANT_ID|accountType.*==|isMsa\( + +# Protocol version checks +grep: negotiatedProtocolVersion|PROTOCOL_VERSION|>= 16 + +# Flight checks +grep: shouldRemove|shouldInclude|isEnabled +``` + +### Key Adapter Classes + +| Class | Purpose | Location | +|-------|---------|----------| +| `MsalBrokerResultAdapter` | Builds BrokerResult from auth results | Common | +| `AdalBrokerResultAdapter` | ADAL version of result adapter | Common | +| `TokenResponseAdapter` | Token response conversion | Common | +| `AccountAdapter` | Account data conversion | MSAL, Common | + +--- + +## Error & Exception Patterns + +### Exception Hierarchy + +```bash +# Base exceptions +grep: BaseException|ClientException|ServiceException + +# UI required +grep: UiRequiredException|MsalUiRequiredException + +# Broker exceptions +grep: BrokerCommunicationException +``` + +### Error Code Constants + +```bash +# In Common +file: **/ErrorStrings.java +grep: ERROR_CODE|public static final String + +# Error string definitions +grep: INVALID_GRANT|AUTH_FAILED|DEVICE_ +``` + +### Error Code Usage + +```bash +# Throwing with error code +grep: ClientException\(.*ERROR + +# Checking error codes +grep: getErrorCode\(\)|errorCode.*== +``` + +--- + +## Quick Reference: Multi-Module Search + +When searching for a concept, **always check all modules**: + +```bash +# Example: Finding "refresh token" handling + +# 1. MSAL layer +grep: refreshToken +includePattern: msal/**/*.java + +# 2. Common/IPC layer +grep: refreshToken|mRefreshToken +includePattern: common/**/*.java + +# 3. Broker layer +grep: refreshToken +includePattern: broker/**/*.kt + +# 4. ADAL layer +grep: refreshToken +includePattern: adal/**/*.java +``` + +### Cross-Module Tracing + +For complete flow tracing: +1. Start at MSAL (`PublicClientApplication`) +2. Follow to Common (`BrokerMsalController`) +3. Trace IPC (`BrokerOperationBundle`) +4. Find Broker handler (`*BrokerOperation.kt`) +5. Check result adapter (`*ResultAdapter`) +6. Trace return path back through Common to MSAL diff --git a/.github/skills/incident-investigator/SKILL.md b/.github/skills/incident-investigator/SKILL.md new file mode 100644 index 00000000..30530159 --- /dev/null +++ b/.github/skills/incident-investigator/SKILL.md @@ -0,0 +1,222 @@ +--- +name: incident-investigator +description: Systematically investigate IcM incidents and customer-reported authentication issues for Android Broker/MSAL. Use this skill when asked to investigate an incident, troubleshoot auth failures, analyze customer logs, diagnose PRT/SSO issues, or review IcM tickets. Triggers include "investigate incident", "troubleshoot IcM", "analyze these logs", "what's wrong with this auth flow", "diagnose this issue", or any request involving incident investigation with evidence-based diagnosis. +--- + +# Incident Investigator + +Investigate Android authentication incidents systematically with evidence-first diagnosis. + +## Investigation Workflow + +Execute these steps IN ORDER. Do not skip steps. + +### Step 1: Gather IcM Context + +Query DRI Copilot MCP FIRST: + +``` +mcp_dricopilotdem_Broker_DRI_Copilot_Project_Explorer +``` + +Extract from IcM: +- **Affected app(s)**: Outlook, Teams, other 1P apps? +- **Account(s)**: Specific user or tenant-wide? +- **Device context**: SDM enabled? Device model? Android version? +- **Symptoms**: What exactly fails? Error messages? +- **Repro conditions**: When does it happen vs. not happen? + +### Step 2: Extract Log Evidence + +Search logs for these key patterns: + +| Pattern | What It Tells You | +|---------|-------------------| +| `correlation_id:` | Request tracking ID for eSTS correlation | +| `error_code` or `Error` | Specific failure reason | +| `No PRT present` | Missing Primary Refresh Token | +| `SignOut` or `removeAccount` | Account removal events | +| `disabled by MDM` | MDM policy interference | +| `invoked for package name:` | Which app made the request | +| `executed successfully` vs `failed` | Operation outcome | + +Build a timeline of events with correlation IDs. + +### Step 3: Analyze Account/Token State + +Check these indicators in logs: + +| Log Message | Indicates | +|-------------|-----------| +| `Found [N] Accounts...` | How many accounts in cache | +| `No PRT present for the account` | PRT missing or wiped | +| `Home Account id doesn't have uid or tenant id` | Incomplete account state | +| `Found more than one account entry` | Duplicate account issue | +| `PRT is already registered-device PRT` | Valid WPJ PRT exists | +| `Loading Workplace Join entry for tenant:` | Device is WPJ'd | + +### Step 4: Identify Operation Flow + +Map the operations that occurred: + +| Operation | Purpose | +|-----------|---------| +| `GetDeviceModeMsalBrokerOperation` | Check if SDM enabled | +| `GetCurrentAccountMsalBrokerOperation` | Fetch signed-in account | +| `AcquireTokenSilentMsalBrokerOperation` | Silent token acquisition | +| `AcquireTokenInteractiveMsalBrokerOperation` | Interactive auth | +| `SignOutFromSharedDeviceMsalBrokerOperation` | SDM sign-out (⚠️ key for SDM issues) | +| `GetPreferredAuthMethodMsalBrokerOperation` | Auth method check | + +### Step 5: Form Hypotheses + +Rank by evidence strength: + +| Confidence | Criteria | +|------------|----------| +| **HIGH** | Direct log evidence shows the issue | +| **MEDIUM** | Logs suggest but don't confirm | +| **LOW** | Inference based on patterns, no direct evidence | + +Common root causes to consider: +- MDM triggering sign-out (Imprivata, other MDMs) +- PRT deleted/expired/revoked +- Device cap reached +- Account-specific CA policy +- SDM misconfiguration +- Broker/app version incompatibility + +### Step 6: Identify Missing Evidence + +State explicitly what's NOT in the logs that would help: +- Missing correlation IDs? +- No sign-out operation captured? +- No eSTS error codes? +- Logs from wrong time window? + +## Output Format + +```markdown +## Investigation: IcM [Number] + +### IcM Summary +| Field | Value | +|-------|-------| +| Affected App(s) | | +| Account | | +| Device | Android [version], Broker [version] | +| SDM Enabled | Yes/No | +| Symptoms | | + +### Key Correlation IDs +| Correlation ID | Operation | Result | +|----------------|-----------|--------| +| `abc-123...` | AcquireTokenSilent | ✅/❌ | + +### Evidence from Logs + +#### Finding 1: [Description] +- **Timestamp**: +- **Evidence**: [Exact log line] +- **Implication**: + +### Hypotheses (Ranked by Evidence) + +| # | Hypothesis | Confidence | Supporting Evidence | +|---|------------|------------|---------------------| +| 1 | | HIGH/MED/LOW | | + +### Missing Evidence +- [ ] [What additional data is needed] + +### Recommended Actions +1. [Next step] +2. [Next step] +``` + +## Common Patterns + +### Pattern: MDM-Triggered Sign-Out (SDM) +**Symptoms**: User signs in, immediately signed out +**Evidence to look for**: +- `SignOutFromSharedDeviceMsalBrokerOperation` from MDM package +- `disabled by MDM` messages +- `No PRT present` after successful auth + +### Pattern: Missing PRT +**Symptoms**: Silent auth fails, interactive works +**Evidence to look for**: +- `No PRT present for the account` +- Check if `AcquireTokenSilent` fails but `AcquireTokenInteractive` succeeds +- Look for prior sign-out or PRT revocation + +### Pattern: Device Cap +**Symptoms**: New device can't register +**Evidence to look for**: +- Error during device registration +- eSTS error about device limit +- Check eSTS logs with correlation ID + +### Pattern: Duplicate Accounts +**Symptoms**: Inconsistent auth behavior +**Evidence to look for**: +- `Found more than one account entry for user` +- Multiple accounts with same UPN but different home account IDs + +## DRI Copilot Queries + +### Initial Query (always start here) + +When given just an incident ID, query DRI Copilot with: + +``` +"Investigate IcM [number]. What are the affected apps, symptoms, and known issues?" +``` + +This single query extracts: +- Affected application(s) +- Customer-reported symptoms +- Account/device context +- Any known root cause or past similar incidents + +### Follow-up Queries (after initial context) + +Once you have context from the initial query, use targeted follow-ups: + +``` +"TSG for error code [error_code]" # After finding error in logs +"Past incidents related to [symptom]" # After identifying symptom from IcM +"How to troubleshoot [specific_issue]" # For deep-dive guidance +``` + +## eSTS Correlation + +Use the Kusto MCP tool to correlate with eSTS when needed: + +``` +mcp_my-mcp-server_execute_query +``` + +**Parameters:** +- **cluster**: `https://estswus2.kusto.windows.net` +- **database**: `ESTS` +- **query**: (see below) + +**Basic correlation query:** +```kql +AllPerRequestTable +| where env_time >= ago(7d) +| where DevicePlatformForUI == "Android" +| where CorrelationId == "[correlation-id]" +| project env_time, CorrelationId, Call, Result, ErrorCode, PrtData +``` + +For more Kusto queries, see [references/kusto-queries.md](references/kusto-queries.md). + +## Key Reminders + +1. **Query DRI Copilot FIRST** - Get IcM context before analyzing logs +2. **Evidence over assumptions** - Only state what logs show +3. **State what's missing** - Be explicit about evidence gaps +4. **Search all log files** - Issue may span multiple log segments +5. **Check for sign-out operations** - Critical for SDM issues diff --git a/.github/skills/incident-investigator/references/error-patterns.md b/.github/skills/incident-investigator/references/error-patterns.md new file mode 100644 index 00000000..87962221 --- /dev/null +++ b/.github/skills/incident-investigator/references/error-patterns.md @@ -0,0 +1,120 @@ +# Common Error Codes and Patterns + +Reference for common authentication error codes encountered in Android Broker/MSAL. + +## Broker Error Codes + +| Error Code | Description | Common Cause | TSG | +|------------|-------------|--------------|-----| +| `auth_cancelled_by_sdk` | SDK cancelled the authentication | CA policy, interrupt flow, timeout | Check for interrupt flow, CA requirements | +| `device_needs_to_be_managed` | Device management required | Intune enrollment required | Verify device compliance | +| `no_account` | No account found in cache | Account removed or never signed in | Check for sign-out operations | +| `invalid_grant` | Token grant is invalid | PRT revoked, password changed | Re-authenticate interactively | +| `interaction_required` | User interaction needed | MFA, consent, password expired | Trigger interactive auth | + +## PRT-Related Issues + +### PRT is Null or Missing + +**Log Pattern:** +``` +No PRT present for the account +``` + +**Possible Causes:** +1. User signed out (check for `SignOutFromSharedDeviceMsalBrokerOperation`) +2. PRT expired (14-day sliding window) +3. Password changed / account revoked +4. MDM policy enforcement +5. Device registration lost + +**Investigation Steps:** +1. Look for sign-out operations in logs +2. Check eSTS for PRT revocation +3. Verify device is still WPJ'd (`Loading Workplace Join entry`) + +### PRT Revocation by eSTS + +Check eSTS with: +```kql +AllPerRequestTable +| where CorrelationId == "[id]" +| where Call contains "prt" +| project env_time, Result, ErrorCode, SubErrorCode +``` + +## MDM-Related Issues + +### Account Type Disabled by MDM + +**Log Pattern:** +``` +Account type com.microsoft.authapppassthroughbackup is disabled by MDM +``` + +**Meaning:** MDM has disabled the passthrough backup account type. This is informational and may not directly cause auth failures, but indicates MDM is actively managing the device. + +### MDM Sign-Out in SDM + +**Log Pattern:** +``` +SignOutFromSharedDeviceMsalBrokerOperation is invoked for package name: [mdm-package] +``` + +**Impact:** MDM is triggering global sign-out, which removes all accounts and PRTs. + +## Account State Issues + +### Home Account ID Missing UID + +**Log Pattern:** +``` +Home Account id doesn't have uid or tenant id information, returning null +``` + +**Meaning:** Account record is incomplete. May indicate: +- Corrupted account state +- Partial sign-in that didn't complete +- Migration issue from older broker version + +### Multiple Account Entries + +**Log Pattern:** +``` +Found more than one account entry for user in appSpecificRecords +``` + +**Meaning:** Duplicate account records exist. May cause inconsistent behavior. + +## SDM-Specific Issues + +### SDM Sign-Out Timing + +In Shared Device Mode, any app can trigger global sign-out. Common pattern: +1. User signs into Outlook +2. MDM app immediately triggers `SignOutFromSharedDeviceMsalBrokerOperation` +3. All accounts/PRTs are wiped +4. User appears signed out + +**Investigation:** Look for sign-out operations from non-Microsoft packages. + +### SDM Account Count + +**Log Pattern:** +``` +Found [0] Accounts... +``` + +In SDM, there should be 0 (signed out) or 1 (signed in) account. Multiple accounts suggest misconfiguration. + +## eSTS Error Codes + +Common eSTS errors visible in broker logs or eSTS telemetry: + +| Error | SubError | Meaning | +|-------|----------|---------| +| `invalid_grant` | `bad_token` | Token is invalid/revoked | +| `invalid_grant` | `token_expired` | Token has expired | +| `interaction_required` | `consent_required` | User needs to consent | +| `interaction_required` | `login_required` | Fresh login needed | +| `access_denied` | `policy_violation` | CA policy blocked access | diff --git a/.github/skills/incident-investigator/references/kusto-queries.md b/.github/skills/incident-investigator/references/kusto-queries.md new file mode 100644 index 00000000..3462cb23 --- /dev/null +++ b/.github/skills/incident-investigator/references/kusto-queries.md @@ -0,0 +1,145 @@ +# Kusto Queries for Incident Investigation + +Reference queries for investigating Android authentication issues in telemetry. + +## Android Broker Telemetry + +**Cluster:** `https://idsharedeus2.kusto.windows.net/` +**Database:** `ad-accounts-android-otel` +**Table:** `android_spans` +**Retention:** 30 days + +### Find Spans by Correlation ID + +```kql +android_spans +| where EventInfo_Time >= ago(7d) +| where correlation_id == "[correlation-id]" +| project EventInfo_Time, span_name, error_code, error_message, span_status +| order by EventInfo_Time asc +``` + +### Find Errors for Device + +```kql +android_spans +| where EventInfo_Time >= ago(7d) +| where DeviceInfo_Id == "[device-id]" +| where isnotempty(error_code) +| summarize count() by error_code, span_name +| order by count_ desc +``` + +### AcquireTokenSilent Failures + +```kql +android_spans +| where EventInfo_Time >= ago(7d) +| where span_name == "AcquireTokenSilent" +| where span_status != "OK" +| summarize + count = count(), + by error_code, calling_package_name +| order by count desc +``` + +### AcquireTokenInteractive with Errors + +```kql +android_spans +| where EventInfo_Time >= ago(7d) +| where span_name == "AcquireTokenInteractive" +| where isnotempty(error_code) +| project EventInfo_Time, correlation_id, error_code, error_message, calling_package_name +| order by EventInfo_Time desc +| take 100 +``` + +### Find Sign-Out Operations + +```kql +android_spans +| where EventInfo_Time >= ago(7d) +| where span_name contains "SignOut" +| project EventInfo_Time, span_name, correlation_id, calling_package_name +| order by EventInfo_Time desc +``` + +## eSTS Telemetry + +**Cluster:** `https://estswus2.kusto.windows.net/` +**Database:** `ESTS` +**Table:** `AllPerRequestTable` + +### Find Request by Correlation ID + +```kql +AllPerRequestTable +| where env_time >= ago(7d) +| where DevicePlatformForUI == "Android" +| where CorrelationId == "[correlation-id]" +| project env_time, CorrelationId, Call, Result, ErrorCode, HttpStatusCode, PrtData, ResponseTime +``` + +### Check PRT Usage + +```kql +AllPerRequestTable +| where env_time >= ago(7d) +| where DevicePlatformForUI == "Android" +| where CorrelationId == "[correlation-id]" +| extend HasPRT = isnotempty(PrtData) +| project env_time, Call, Result, HasPRT, ErrorCode +``` + +### Error Pattern by Application + +```kql +AllPerRequestTable +| where env_time >= ago(7d) +| where DevicePlatformForUI == "Android" +| where ApplicationId == "[app-client-id]" +| where Result != "Success" +| summarize count() by ErrorCode, Call +| order by count_ desc +``` + +### User/Device Request History + +```kql +AllPerRequestTable +| where env_time >= ago(7d) +| where DevicePlatformForUI == "Android" +| where DeviceId == "[device-id]" +| project env_time, Call, Result, ErrorCode, ApplicationId +| order by env_time desc +| take 100 +``` + +## Common Client IDs + +| App | Client ID | +|-----|-----------| +| Outlook | `27922004-5251-4030-b22d-91ecd9a37ea4` | +| Teams | `1fec8e78-bce4-4aaf-ab1b-5451cc387264` | +| Authenticator | `4813382a-8fa7-425e-ab75-3b753aab3abb` | +| Company Portal | `0000000a-0000-0000-c000-000000000000` | + +## Correlating Broker + eSTS + +1. Get correlation ID from broker logs +2. Query `android_spans` for broker-side view +3. Query `AllPerRequestTable` for eSTS-side view +4. Match timestamps and operations + +```kql +// Step 1: Broker side +android_spans +| where correlation_id == "[id]" +| project EventInfo_Time, span_name, error_code + +// Step 2: eSTS side +AllPerRequestTable +| where CorrelationId == "[id]" +| project env_time, Call, Result, ErrorCode +``` diff --git a/.github/skills/kusto-analyst/SKILL.md b/.github/skills/kusto-analyst/SKILL.md new file mode 100644 index 00000000..a0083f75 --- /dev/null +++ b/.github/skills/kusto-analyst/SKILL.md @@ -0,0 +1,364 @@ +--- +name: kusto-analyst +description: Analyze Android authentication telemetry using Azure Data Explorer (Kusto). Use this skill for querying android_spans, eSTS correlation, latency investigation, error analysis, and telemetry troubleshooting. Triggers include "query Kusto", "analyze telemetry", "check android_spans", "eSTS correlation", "latency investigation", "error patterns", or any request involving telemetry data analysis. +--- + +# Kusto Analyst + +Analyze Android authentication telemetry using Azure Data Explorer (Kusto) for error analysis, latency investigation, and cross-cluster correlation. + +## Available MCP Tools + +**Always use these tools to execute Kusto queries:** +- `mcp_my-mcp-server_execute_query` - Execute Kusto queries +- `mcp_my-mcp-server_list_tables` - Discover available tables +- `mcp_my-mcp-server_get_table_schema` - Explore field schema + +--- + +## Android Telemetry Cluster + +### Cluster Information +| Property | Value | +|----------|-------| +| **Cluster URL** | `https://idsharedeus2.kusto.windows.net/` | +| **Production Database** | `ad-accounts-android-otel` | +| **Sandbox Database** | `android-broker-otel-sandbox` | + +### Primary Tables +| Table | Purpose | Retention | +|-------|---------|-----------| +| `android_spans` | Authentication telemetry spans | 30 days | +| `android_metrics` | Aggregated metrics data | 30 days | + +### Materialized Views +- **46 pre-aggregated views** for faster queries +- **Retention:** 90 days (longer than raw tables!) +- **Update frequency:** Hourly +- **Discover with:** `.show materialized-views` query +- **Categories:** Error Analysis, Silent/Interactive Auth, PRT Operations, Broker & Apps, Devices, Performance + +### User Intent Translation +| User Says | Span Name | +|-----------|-----------| +| "Interactive request" | `AcquireTokenInteractive` | +| "Silent request" | `AcquireTokenSilent` | +| "PRT operation" | Various PRT-related spans | + +--- + +## android_spans Key Fields + +### Span Identification +| Field | Description | +|-------|-------------| +| `span_id` | Unique identifier for the span | +| `parent_span_id` | Parent span ID for hierarchical relationships | +| `trace_id` | Trace ID linking related spans | +| `correlation_id` | Correlation ID for request tracking (use for eSTS correlation) | +| `span_name` | Operation name (e.g., "AcquireTokenInteractive") | + +### Error Information +| Field | Description | +|-------|-------------| +| `error_code` | Error code (e.g., "auth_cancelled_by_sdk") | +| `error_message` | Detailed error message | +| `span_status` | Status ("OK", "ERROR") | + +### Broker Information +| Field | Description | +|-------|-------------| +| `active_broker_package_name` | Currently active broker package | +| `current_broker_package_name` | Current broker package | +| `calling_package_name` | Package that initiated the call | + +**Common Broker Packages:** +- `com.microsoft.windowsintune.companyportal` - Company Portal +- `com.azure.authenticator` - Azure Authenticator +- `com.microsoft.appmanager` - Microsoft App Manager + +### Device & Timing +| Field | Description | +|-------|-------------| +| `DeviceInfo_Id` | Unique device identifier | +| `DeviceInfo_Model` | Device model (e.g., "Pixel 7 Pro") | +| `EventInfo_Time` | Event timestamp (use `ago(Xd)` for filtering) | +| `elapsed_time` | Total operation duration | + +--- + +## Common Query Patterns + +### Discovery Queries + +**Find top span names:** +```kql +android_spans +| where EventInfo_Time >= ago(7d) +| summarize count() by span_name +| order by count_ desc +| take 30 +``` + +**Find common error codes:** +```kql +android_spans +| where EventInfo_Time >= ago(7d) +| where isnotempty(error_code) +| summarize count() by error_code +| order by count_ desc +| take 20 +``` + +### Error Analysis + +**Error patterns for specific span:** +```kql +android_spans +| where EventInfo_Time >= ago(7d) +| where span_name == "AcquireTokenInteractive" +| where isnotempty(error_code) +| summarize error_count = count() by error_code, error_message +| order by error_count desc +``` + +**Device-level error aggregation:** +```kql +android_spans +| where EventInfo_Time >= ago(7d) +| summarize + total_devices = dcount(DeviceInfo_Id), + error_count = count() + by error_code +``` + +### Company Portal Detection + +```kql +android_spans +| where EventInfo_Time >= ago(7d) +| extend has_cp = iff( + active_broker_package_name contains "companyportal" or + calling_package_name contains "companyportal", + 1, 0) +| summarize + total = count(), + with_cp = countif(has_cp == 1) +| extend cp_percentage = round(100.0 * with_cp / total, 2) +``` + +### Parent-Child Span Relationships + +```kql +let parentSpans = android_spans +| where EventInfo_Time >= ago(7d) +| where span_name == "AcquireTokenInteractive" +| project parent_span_id = span_id, trace_id; + +let childSpans = android_spans +| where EventInfo_Time >= ago(7d) +| where span_name == "ProcessWebCpRedirects" +| project child_span_id = span_id, parent_span_id, trace_id; + +parentSpans +| join kind=inner (childSpans) on trace_id +``` + +--- + +## Latency Investigation Workflow + +When investigating latency increases (e.g., AcquireTokenSilent), follow these steps: + +### Step 1: Identify the Increase + +```kql +android_spans +| where EventInfo_Time >= ago(7d) +| where span_name == "AcquireTokenSilent" +| summarize + p50 = percentile(elapsed_time, 50), + p90 = percentile(elapsed_time, 90), + p95 = percentile(elapsed_time, 95), + p99 = percentile(elapsed_time, 99) + by bin(EventInfo_Time, 1h) +| order by EventInfo_Time desc +``` + +### Step 2: Find Culprit Dimensions + +```kql +android_spans +| where EventInfo_Time >= ago(3d) +| where span_name == "AcquireTokenSilent" +| summarize + count = count(), + p90_latency = percentile(elapsed_time, 90) + by active_broker_package_name, current_broker_package_name +| order by p90_latency desc +``` + +### Step 3: Check Error Rate Correlation + +```kql +android_spans +| where EventInfo_Time >= ago(7d) +| where span_name == "AcquireTokenSilent" +| summarize + total = count(), + errors = countif(isnotempty(error_code)), + avg_latency = avg(elapsed_time) + by bin(EventInfo_Time, 1h) +| extend error_rate = round(100.0 * errors / total, 2) +| order by EventInfo_Time desc +``` + +### Step 4: Analyze Elapsed Time Breakdown + +```kql +android_spans +| where EventInfo_Time >= ago(3d) +| where span_name == "AcquireTokenSilent" +| where isnotempty(elapsed_time_cache_load) or isnotempty(elapsed_time_network_acquire_at) +| summarize + avg_cache = avg(elapsed_time_cache_load), + avg_network = avg(elapsed_time_network_acquire_at), + avg_total = avg(elapsed_time) + by bin(EventInfo_Time, 1h) +``` + +--- + +## MATS telemetry + +### Cluster Information +| Property | Value | +|----------|-------| +| **Cluster URL** | `https://idsharedeus2.kusto.windows.net/` | +| **Database** | `MATS_Office` | +| **Database ID** | `faab4ead691e451eb230afc98a28e0f2` | + + +--- + +## eSTS (Token Service) Cluster + +### Cluster Information +| Property | Value | +|----------|-------| +| **Cluster URL** | `https://estswus2.kusto.windows.net/` | +| **Database** | `ESTS` | +| **Primary Table** | `AllPerRequestTable` (cross-cluster union view) | + +### Android-Specific Filtering + +**⚠️ ALWAYS filter by Android platform:** +```kql +AllPerRequestTable +| where env_time >= ago(7d) +| where DevicePlatformForUI == "Android" +``` + +### Key eSTS Fields + +| Category | Field | Description | +|----------|-------|-------------| +| **Request ID** | `CorrelationId` | Links to Android `correlation_id` | +| | `RequestId` | Unique eSTS request ID | +| | `env_time` | Request timestamp | +| **Request Type** | `Call` | Auth call type (e.g., "token") | +| | `IsInteractive` | User interaction required | +| | `Prompt` | Prompt type ("none", "login") | +| **Status** | `Result` | "Success" or "Failure" | +| | `ErrorCode` | Error code if failed | +| | `HttpStatusCode` | HTTP status | +| **PRT** | `PrtData` | PRT-related data (JSON) | +| **Device** | `DeviceId` | Device identifier | +| | `ApplicationId` | Client app ID | +| **User** | `TenantId` | Tenant ID | +| | `UserPrincipalObjectID` | User's Entra ID object ID | +| | `AccountType` | AAD, MSA, etc. | + +--- + +## Cross-Cluster Correlation + +To trace a complete flow (Android → Broker → eSTS): + +### Step 1: Get correlation IDs from Android spans + +```kql +// Run against: https://idsharedeus2.kusto.windows.net/ | ad-accounts-android-otel +android_spans +| where EventInfo_Time >= ago(7d) +| where span_name == "AcquireTokenInteractive" +| where error_code == "some_error" +| project correlation_id, span_id, EventInfo_Time, error_code +| take 100 +``` + +### Step 2: Find corresponding eSTS requests + +```kql +// Run against: https://estswus2.kusto.windows.net/ | ESTS +AllPerRequestTable +| where env_time >= ago(7d) +| where DevicePlatformForUI == "Android" +| where CorrelationId in ("correlation-id-1", "correlation-id-2") +| project env_time, CorrelationId, Call, Result, ErrorCode, PrtData, ResponseTime +``` + +### eSTS Query Examples + +**Find requests by CorrelationId:** +```kql +AllPerRequestTable +| where env_time >= ago(7d) +| where DevicePlatformForUI == "Android" +| where CorrelationId == "your-correlation-id-here" +| project env_time, CorrelationId, Call, Result, ErrorCode, IsInteractive, PrtData, ResponseTime +``` + +**Check PRT usage:** +```kql +AllPerRequestTable +| where env_time >= ago(7d) +| where DevicePlatformForUI == "Android" +| extend HasPRT = isnotempty(PrtData) +| summarize + total_requests = count(), + prt_requests = countif(HasPRT), + success_rate = round(100.0 * countif(Result == "Success") / count(), 2) + by HasPRT +``` + +**Error patterns:** +```kql +AllPerRequestTable +| where env_time >= ago(7d) +| where DevicePlatformForUI == "Android" +| where Result != "Success" +| summarize error_count = count() by ErrorCode, SubErrorCode, Call +| order by error_count desc +| take 20 +``` + +--- + +## Query Optimization Tips + +| Tip | Example | +|-----|---------| +| **Always filter by time first** | `| where EventInfo_Time >= ago(7d)` | +| **Use `take` for exploration** | `| take 1000` to prevent timeouts | +| **Project early** | `| project field1, field2` to reduce data | +| **Use `dcount()` for unique counts** | `dcount(DeviceInfo_Id)` | +| **Check field population** | `| where isnotempty(error_code)` | +| **Break long ranges** | Queries > 7 days may timeout | + +## Important Notes + +- **Sensitive Data:** `correlation_id` may be scrubbed to "Scrubbed" for privacy +- **Field Availability:** Not all fields populated in all spans; use `isnotempty()` +- **Cross-Cluster Joins:** Cannot directly join Android + eSTS clusters; correlate via CorrelationId +- **PrtData Parsing:** Use `parse_json()` or `extend` to extract PRT fields diff --git a/.github/skills/prompt-refiner/SKILL.md b/.github/skills/prompt-refiner/SKILL.md new file mode 100644 index 00000000..acf94744 --- /dev/null +++ b/.github/skills/prompt-refiner/SKILL.md @@ -0,0 +1,270 @@ +--- +name: prompt-refiner +description: Refine rough prompts into structured, high-quality prompts. Use this skill when the user has a vague request and wants to turn it into a well-structured prompt with clear objectives, constraints, and acceptance criteria. Triggers include "refine this prompt", "make this prompt better", "structure this request", or "help me write a better prompt". +--- + +# Prompt Refiner + +Transform rough, vague prompts into structured prompts that produce accurate, actionable results. + +## References + +Use these templates in the `references/` folder based on task type: +- **[template-exploration.md](references/template-exploration.md)** - Understanding code, finding implementations, tracing flows +- **[template-feature.md](references/template-feature.md)** - Implementing new functionality, adding screens +- **[template-bugfix.md](references/template-bugfix.md)** - Investigating and fixing bugs, crashes +- **[template-telemetry.md](references/template-telemetry.md)** - Adding logging, events, instrumentation + +## Why This Matters + +Vague prompts lead to: +- Hallucinated file names and patterns +- Generic advice instead of specific guidance +- Missing validation steps +- Wasted iteration cycles + +Structured prompts lead to: +- Grounded responses with file paths and evidence +- Actionable next steps +- Built-in validation checkpoints +- Faster time-to-value + +## Refinement Workflow + +### Step 1: Analyze the Rough Prompt + +Identify what's missing: +- **Objective**: What is the actual goal? (Often buried or implied) +- **Scope**: What's in/out of bounds? +- **Constraints**: What rules must be followed? +- **Evidence requirements**: Should responses cite files/code? +- **Validation**: How will we know if the answer is correct? + +### Step 2: Ask Clarifying Questions (if needed) + +Before refining, ask the user 2-3 targeted questions: +- "Is this for new code or modifying existing code?" +- "Should this be behind a feature flag?" +- "What's the risk level? (experimental vs production-critical)" +- "Are there existing patterns in the codebase I should follow?" + +### Step 3: Generate the Refined Prompt + +Use this template structure: + +```markdown +## Objective +[One clear sentence describing the goal] + +## Context +[Brief background if needed - what problem this solves, why now] + +## Constraints +- [Hard rule 1 - e.g., "Only reference files that exist in the repo"] +- [Hard rule 2 - e.g., "Do not modify existing public APIs"] +- [Hard rule 3 - e.g., "Must be behind a feature flag"] + +## Scope +**In scope:** +- [What should be addressed] + +**Out of scope:** +- [What should NOT be addressed] + +## Acceptance Criteria +- [ ] [Specific, verifiable criterion 1] +- [ ] [Specific, verifiable criterion 2] +- [ ] [Validation step - e.g., "Compile check passes"] + +## Output Format +[Specify what the response should look like - file paths, code snippets, plan, etc.] +``` + +### Step 4: Add Domain-Specific Constraints + +Based on the task type, add relevant constraints: + +**For code exploration/understanding:** +- "Provide file paths and line numbers for all references" +- "Do not guess patterns—search the codebase first" +- "Show the call flow with actual function names" + +**For new feature implementation:** +- "Wrap new functionality behind ExperimentationFeatureFlag" +- "Follow existing patterns in [similar feature area]" +- "Include unit test recommendations" + +**For bug fixes:** +- "Identify root cause before proposing fix" +- "List potential regression risks" +- "Include validation steps to confirm fix" + +**For refactoring:** +- "No behavioral changes—preserve existing functionality" +- "Show before/after for each change" +- "Ensure all existing tests still pass" + +**For telemetry/logging:** +- "No PII in any logged fields" +- "Include local validation approach" +- "Specify sampling/flag configuration" + +## Output Format + +When refining a prompt, provide: + +1. **The refined prompt** (ready to copy/paste) +2. **What was added** (brief explanation of key improvements) +3. **Suggested follow-ups** (what to ask next after getting results) + +## Examples + +### Example 1: Code Exploration + +**Rough prompt:** +> "Where is authentication handled in this app?" + +**Refined prompt:** +```markdown +## Objective +Identify the entry points and core components responsible for authentication in this codebase. + +## Constraints +- Only reference files that exist in the repo (provide file paths) +- Focus on the primary authentication flow, not edge cases +- Do not guess—search for actual implementations + +## Acceptance Criteria +- [ ] Entry point(s) identified with file paths +- [ ] Key classes/functions listed with their responsibilities +- [ ] Call flow documented (what calls what) +- [ ] Links to relevant configuration files + +## Output Format +Provide a brief architecture overview with file paths, then a numbered call flow. +``` + +**What was added:** Evidence requirements (file paths), scope limitation (primary flow), specific output format. + +--- + +### Example 2: New Feature + +**Rough prompt:** +> "Add a retry mechanism for failed API calls" + +**Refined prompt:** +```markdown +## Objective +Implement a retry mechanism for failed API calls with configurable retry count and backoff. + +## Context +Some API calls fail transiently due to network issues. We need automatic retry with exponential backoff. + +## Constraints +- Use existing HTTP client infrastructure (do not add new libraries) +- Wrap behind ExperimentationFeatureFlag.API_RETRY +- Only retry on transient errors (5xx, timeout), not client errors (4xx) +- Maximum 3 retries with exponential backoff (1s, 2s, 4s) + +## Scope +**In scope:** Core retry logic, configuration, integration with existing client + +**Out of scope:** UI changes, offline handling, request queuing + +## Acceptance Criteria +- [ ] Retry logic implemented with configurable count +- [ ] Exponential backoff with jitter +- [ ] Feature flag integration +- [ ] Unit tests for retry scenarios (success after retry, max retries exceeded) +- [ ] Compile check passes: `.\gradlew app:compileProductionDebugKotlin` + +## Output Format +1. Implementation plan (which files to modify) +2. Code changes with file paths +3. Test cases to add +``` + +**What was added:** Specific behavior (which errors to retry), constraints (feature flag, no new libs), concrete acceptance criteria. + +--- + +### Example 3: Telemetry + +**Rough prompt:** +> "Add logging for the sign-in flow" + +**Refined prompt:** +```markdown +## Objective +Add telemetry events to track sign-in flow success, failure, and duration. + +## Constraints +- **No PII**: Do not log email, username, phone, device ID, or tokens +- Use existing telemetry service (SharedCoreLibrary logging) +- Events must be behind a feature flag or sampling config +- Each event must answer a specific business question + +## Event Requirements +For each event, define: +- Event name (namespaced: `signin_*`) +- Purpose (what question does this answer?) +- Fields (name, type, example, PII risk) +- Trigger condition + +## Acceptance Criteria +- [ ] 2-3 events defined with full schema +- [ ] Logging points identified (file paths + function names) +- [ ] Local validation approach documented +- [ ] No PII in any field +- [ ] Feature flag specified + +## Output Format +Event table, then implementation locations, then validation steps. +``` + +**What was added:** Explicit PII prohibition, schema requirements, validation approach. + +## Anti-Patterns to Avoid + +When refining prompts, watch for and fix these issues: + +| Anti-Pattern | Problem | Fix | +|--------------|---------|-----| +| "Make it good" | Subjective, unmeasurable | Add specific acceptance criteria | +| "Handle all cases" | Unbounded scope | Define in-scope vs out-of-scope | +| "Like other apps do" | Relies on assumptions | Reference specific patterns in THIS codebase | +| "ASAP" | Pressure without clarity | Define actual priority and constraints | +| No validation step | Can't verify correctness | Add "how do we know it's right?" | + +## Quick Reference: Constraint Templates + +Copy-paste these common constraints as needed: + +**Evidence-based responses:** +``` +- Only reference files that exist in the repo (provide file paths + line numbers) +- Do not guess patterns—search the codebase first +- Show actual code/config, not hypothetical examples +``` + +**Safe implementation:** +``` +- Wrap new functionality behind ExperimentationFeatureFlag.[FLAG_NAME] +- No breaking changes to existing public APIs +- Follow existing patterns in [similar area of codebase] +``` + +**Privacy/security:** +``` +- No PII in logs (email, phone, name, device ID, tokens) +- No hardcoded secrets or credentials +- Use SecureKeystoreLibrary for sensitive storage +``` + +**Validation:** +``` +- Compile check: `.\gradlew [module]:compileProductionDebugKotlin` +- Existing tests pass: `.\gradlew [module]:test` +- Manual verification steps documented +``` diff --git a/.github/skills/prompt-refiner/references/template-bugfix.md b/.github/skills/prompt-refiner/references/template-bugfix.md new file mode 100644 index 00000000..99da3b36 --- /dev/null +++ b/.github/skills/prompt-refiner/references/template-bugfix.md @@ -0,0 +1,182 @@ +# Prompt Template: Bug Fix + +Use this template when investigating and fixing bugs, crashes, or unexpected behavior. + +## Template + +```markdown +## Objective +Fix [bug/issue] where [symptom] occurs when [condition]. + +## Observed Behavior +- **What happens:** [Describe the bug] +- **Expected:** [What should happen] +- **Repro steps:** [How to reproduce] +- **Frequency:** [Always / Sometimes / Rare] + +## Context +- **Affected area:** [Screen/feature/flow] +- **First noticed:** [When - release, commit, date] +- **User impact:** [Severity - blocking, degraded, cosmetic] + +## Constraints +- Identify root cause before proposing fix +- Minimize change scope - fix the bug, don't refactor +- No breaking changes to existing behavior +- Add regression test to prevent recurrence + +## Investigation Steps +1. [Where to look first] +2. [What to trace] +3. [How to reproduce locally] + +## Acceptance Criteria +- [ ] Root cause identified with evidence +- [ ] Fix addresses root cause (not just symptom) +- [ ] Existing tests still pass +- [ ] New test added covering this case +- [ ] No regressions in related functionality +- [ ] Compile check passes: `.\gradlew [module]:compileProductionDebugKotlin` + +## Output Format +1. Root cause analysis +2. Proposed fix with file paths +3. Regression test to add +4. Verification steps +``` + +## Examples + +### Crash Bug +```markdown +## Objective +Fix crash in [FeatureActivity] when user [action] with [condition]. + +## Observed Behavior +- **What happens:** App crashes with NullPointerException +- **Expected:** [Expected behavior] +- **Repro steps:** + 1. Open [screen] + 2. [Action] + 3. App crashes +- **Frequency:** Always when [condition] + +## Context +- **Affected area:** [Feature] flow +- **First noticed:** After [version/commit] +- **User impact:** Blocking - users cannot complete [task] +- **Stack trace:** + ``` + java.lang.NullPointerException: ... + at com.microsoft.authenticator.[Class].[method]([File].kt:123) + ``` + +## Constraints +- Identify why the null occurs, don't just add null checks everywhere +- Preserve existing behavior for non-null cases +- Add test that would have caught this + +## Investigation Steps +1. Find the crash location from stack trace +2. Trace where the null value originates +3. Determine why it's null in this scenario +4. Check if this is a race condition, missing initialization, or bad data + +## Acceptance Criteria +- [ ] Root cause identified (why is it null?) +- [ ] Fix prevents null at source (not just null check at crash site) +- [ ] Unit test added that reproduces the scenario +- [ ] No new crashes in related flows +- [ ] Compile check passes + +## Output Format +Root cause → Fix → Test → Verification steps +``` + +### Logic Bug +```markdown +## Objective +Fix incorrect [behavior] where [wrong thing] happens instead of [right thing]. + +## Observed Behavior +- **What happens:** [Wrong behavior] +- **Expected:** [Correct behavior] +- **Repro steps:** [Steps] +- **Frequency:** [When it occurs] + +## Context +- **Affected area:** [Component] +- **First noticed:** [When] +- **User impact:** [Impact description] + +## Constraints +- Understand the intended logic before changing +- Check if this is a regression (was it ever correct?) +- Verify fix doesn't break other code paths + +## Investigation Steps +1. Find the logic that produces wrong result +2. Trace inputs to understand why wrong path is taken +3. Check for off-by-one, wrong comparison, missing condition +4. Review recent changes to this area + +## Acceptance Criteria +- [ ] Incorrect logic identified +- [ ] Fix produces correct behavior for all cases +- [ ] Edge cases considered (null, empty, boundary values) +- [ ] Test added covering the bug scenario +- [ ] Existing tests still pass + +## Output Format +Analysis → Root cause → Fix → Test cases +``` + +### UI Bug +```markdown +## Objective +Fix UI issue where [visual problem] appears in [location/condition]. + +## Observed Behavior +- **What happens:** [Visual description - overlap, wrong color, missing element] +- **Expected:** [Correct appearance] +- **Repro steps:** [How to see it] +- **Affected configurations:** [Light/dark mode, screen sizes, languages] + +## Context +- **Affected screen:** [Screen name] +- **Component:** [Composable/View name] +- **User impact:** [Cosmetic / Confusing / Blocking] + +## Constraints +- Use CommonColors.kt for any color fixes +- Maintain accessibility (contrast, touch targets) +- Test both light and dark mode +- Check RTL layout if text-related + +## Investigation Steps +1. Identify the composable/view responsible +2. Check modifier order, constraints, theme usage +3. Test in both light/dark mode +4. Test on different screen sizes + +## Acceptance Criteria +- [ ] Visual issue resolved in all affected configurations +- [ ] Light mode correct +- [ ] Dark mode correct +- [ ] Accessibility maintained +- [ ] No regressions in related UI + +## Output Format +Problem location → Fix → Before/after description → Test checklist +``` + +## Key Constraints for Bug Fixes + +Always include root cause requirement and regression prevention: + +```markdown +- Identify root cause before proposing fix (don't just mask symptoms) +- Add regression test that would have caught this bug +- Minimize change scope - fix the bug, don't refactor unrelated code +- Verify fix doesn't break other code paths +``` diff --git a/.github/skills/prompt-refiner/references/template-exploration.md b/.github/skills/prompt-refiner/references/template-exploration.md new file mode 100644 index 00000000..b399e995 --- /dev/null +++ b/.github/skills/prompt-refiner/references/template-exploration.md @@ -0,0 +1,101 @@ +# Prompt Template: Code Exploration + +Use this template when you need to understand unfamiliar code, find where something is implemented, or trace a flow. + +## Template + +```markdown +## Objective +[Understand/Find/Trace] [specific thing] in the codebase. + +## Context +[Why you need this - new to repo, investigating bug, planning feature, etc.] + +## Constraints +- Only reference files that exist in the repo (provide file paths + line numbers) +- Do not guess patterns—search the codebase first +- Focus on [primary flow / specific area], not edge cases + +## Questions to Answer +1. [Specific question 1 - e.g., "Where is the entry point?"] +2. [Specific question 2 - e.g., "What classes are involved?"] +3. [Specific question 3 - e.g., "How does data flow through?"] + +## Acceptance Criteria +- [ ] Entry point(s) identified with file paths +- [ ] Key components listed with responsibilities +- [ ] Call flow documented (what calls what) +- [ ] Relevant config/manifest entries noted + +## Output Format +Brief architecture overview, then numbered call flow with file paths. +``` + +## Examples + +### Finding Authentication Flow +```markdown +## Objective +Understand how user authentication is implemented in this app. + +## Context +I'm new to this codebase and need to add a new auth provider. + +## Constraints +- Only reference files that exist (provide file paths + line numbers) +- Do not guess—search the codebase first +- Focus on the primary login flow, not account recovery or MFA + +## Questions to Answer +1. Where does authentication start (UI entry point)? +2. What service/repository handles auth logic? +3. How are tokens stored and refreshed? +4. Where is the auth state managed? + +## Acceptance Criteria +- [ ] Login entry point identified +- [ ] Auth service/repository located +- [ ] Token storage mechanism found +- [ ] State management approach documented + +## Output Format +Architecture overview, then call flow from UI → service → storage. +``` + +### Tracing a Data Flow +```markdown +## Objective +Trace how [data type] flows from [source] to [destination]. + +## Context +Investigating why [data] sometimes appears incorrect in [location]. + +## Constraints +- Provide file paths for each step in the flow +- Note any transformations or validations along the way +- Flag any async/background processing + +## Questions to Answer +1. Where does [data] originate? +2. What transformations occur? +3. Where is it persisted? +4. How does it reach [destination]? + +## Acceptance Criteria +- [ ] Complete data flow mapped with file paths +- [ ] Transformations documented +- [ ] Potential failure points identified + +## Output Format +Numbered flow diagram with file:line references. +``` + +## Key Constraints for Exploration + +Always include these to get grounded responses: + +```markdown +- Only reference files that exist in the repo (provide file paths + line numbers) +- Do not guess patterns—search the codebase first +- Show actual code snippets, not hypothetical examples +``` diff --git a/.github/skills/prompt-refiner/references/template-feature.md b/.github/skills/prompt-refiner/references/template-feature.md new file mode 100644 index 00000000..95dc1dd9 --- /dev/null +++ b/.github/skills/prompt-refiner/references/template-feature.md @@ -0,0 +1,146 @@ +# Prompt Template: New Feature Implementation + +Use this template when implementing new functionality, adding capabilities, or building new screens/flows. + +## Template + +```markdown +## Objective +Implement [feature] that [does what] for [who/what]. + +## Context +[Why this feature is needed - user problem, business requirement, technical debt] + +## Constraints +- Wrap behind ExperimentationFeatureFlag.[FLAG_NAME] +- Use existing [patterns/libraries/infrastructure] - do not add new dependencies +- Follow patterns in [similar existing feature] +- No breaking changes to existing [APIs/behavior] + +## Scope +**In scope:** +- [Specific capability 1] +- [Specific capability 2] + +**Out of scope:** +- [What NOT to build] +- [Future enhancements to defer] + +## Technical Requirements +- [Requirement 1 - e.g., "Must work offline"] +- [Requirement 2 - e.g., "Response time < 200ms"] +- [Requirement 3 - e.g., "Support Android API 26+"] + +## Acceptance Criteria +- [ ] [Functional criterion 1] +- [ ] [Functional criterion 2] +- [ ] Feature flag integration working +- [ ] Unit tests added for [key logic] +- [ ] Compile check passes: `.\gradlew [module]:compileProductionDebugKotlin` + +## Output Format +1. Implementation plan (files to create/modify) +2. Code changes with file paths +3. Test cases to add +``` + +## Examples + +### Adding a Retry Mechanism +```markdown +## Objective +Implement automatic retry for failed API calls with exponential backoff. + +## Context +Users experience intermittent failures due to network issues. Automatic retry will improve reliability. + +## Constraints +- Wrap behind ExperimentationFeatureFlag.API_RETRY_ENABLED +- Use existing OkHttp client - do not add new HTTP libraries +- Only retry transient errors (5xx, timeout), not client errors (4xx) +- Maximum 3 retries with exponential backoff (1s, 2s, 4s) + +## Scope +**In scope:** +- Retry logic with configurable count +- Exponential backoff with jitter +- Logging of retry attempts + +**Out of scope:** +- UI indication of retries +- Offline queue/sync +- Per-endpoint retry configuration + +## Technical Requirements +- Thread-safe implementation +- Configurable via remote config +- No memory leaks from pending retries + +## Acceptance Criteria +- [ ] Retry logic triggers on 5xx and timeout +- [ ] Does not retry on 4xx errors +- [ ] Respects max retry count +- [ ] Backoff timing is correct (1s, 2s, 4s + jitter) +- [ ] Feature flag disables all retry behavior +- [ ] Unit tests cover: success, retry-then-success, max-retries-exceeded +- [ ] Compile check passes + +## Output Format +Implementation plan, then code, then tests. +``` + +### Adding a New Screen (Compose) +```markdown +## Objective +Create a new [ScreenName] screen that [displays/allows] [what]. + +## Context +[Why this screen is needed] + +## Constraints +- Use Jetpack Compose (not XML layouts) +- Colors from CommonColors.kt only +- Strings from strings.xml (no hardcoded text) +- Follow patterns in [similar existing screen] +- Wrap navigation behind ExperimentationFeatureFlag.[FLAG] + +## Scope +**In scope:** +- Screen UI with [components] +- ViewModel with [state] +- Navigation from [source] + +**Out of scope:** +- [Related screen] +- [Advanced feature] + +## Technical Requirements +- Support light/dark mode (via CommonColors) +- Accessible (content descriptions, focusable elements) +- Handle loading/error/empty states + +## Acceptance Criteria +- [ ] Screen renders correctly in light and dark mode +- [ ] All strings are localized (in strings.xml) +- [ ] ViewModel unit tests added +- [ ] Navigation works from [source] +- [ ] Feature flag gates access +- [ ] Compile check passes + +## Output Format +1. File structure (new files to create) +2. ViewModel implementation +3. Composable implementation +4. Navigation wiring +5. Tests +``` + +## Key Constraints for Features + +Always include feature flag and pattern-following: + +```markdown +- Wrap behind ExperimentationFeatureFlag.[FLAG_NAME] +- Follow existing patterns in [similar feature area] +- Use existing infrastructure - do not add new libraries without approval +``` diff --git a/.github/skills/prompt-refiner/references/template-telemetry.md b/.github/skills/prompt-refiner/references/template-telemetry.md new file mode 100644 index 00000000..bb9bf01a --- /dev/null +++ b/.github/skills/prompt-refiner/references/template-telemetry.md @@ -0,0 +1,189 @@ +# Prompt Template: Telemetry & Logging + +Use this template when adding telemetry events, logging, or instrumentation. + +## Template + +```markdown +## Objective +Add telemetry to [track/measure/understand] [what] in [feature/flow]. + +## Context +[Why this telemetry is needed - what question does it answer?] + +## Constraints +- **No PII**: Do not log email, phone, username, device ID, IP, or tokens +- Use existing telemetry infrastructure (do not add new logging libraries) +- Events must be behind a feature flag or sampling config +- Each event must answer a specific business/engineering question + +## Event Schema +For each event, define: + +| Field | Description | +|-------|-------------| +| Event name | Namespaced name (e.g., `feature_action_result`) | +| Purpose | What question does this answer? | +| Fields | Name, type, example value, PII risk | +| Trigger | When exactly is this logged? | + +## Events to Add + +### Event 1: [event_name] +- **Purpose:** [Question it answers] +- **Trigger:** [When logged] +- **Fields:** + | Name | Type | Example | PII Risk | + |------|------|---------|----------| + | field1 | string | "value" | None | + | field2 | int | 123 | None | + +### Event 2: [event_name] +[Same structure] + +## Acceptance Criteria +- [ ] Events defined with full schema +- [ ] No PII in any field (verified) +- [ ] Logging points identified (file paths + functions) +- [ ] Feature flag or sampling configured +- [ ] Local validation documented (how to see logs) +- [ ] Privacy review checklist completed + +## Output Format +1. Event definitions (table format) +2. Implementation locations (file paths) +3. Local validation steps +4. Privacy checklist +``` + +## Examples + +### Feature Usage Telemetry +```markdown +## Objective +Add telemetry to track usage patterns for [Feature X] to understand adoption and success rate. + +## Context +Product needs to know: How many users try Feature X? How many succeed? Where do they drop off? + +## Constraints +- **No PII**: No user identifiers, emails, or device IDs +- Use existing AriaLogger from SharedCoreLibrary +- Behind ExperimentationFeatureFlag.FEATURE_X_TELEMETRY +- Sample at 100% initially, can reduce if volume too high + +## Events to Add + +### Event 1: feature_x_started +- **Purpose:** Track feature entry rate +- **Trigger:** User opens Feature X screen +- **Fields:** + | Name | Type | Example | PII Risk | + |------|------|---------|----------| + | entry_point | string | "settings" | None | + | timestamp | long | 1704067200000 | None | + +### Event 2: feature_x_completed +- **Purpose:** Measure success rate +- **Trigger:** User successfully completes Feature X flow +- **Fields:** + | Name | Type | Example | PII Risk | + |------|------|---------|----------| + | duration_ms | long | 5432 | None | + | steps_completed | int | 3 | None | + +### Event 3: feature_x_abandoned +- **Purpose:** Understand drop-off points +- **Trigger:** User exits Feature X without completing +- **Fields:** + | Name | Type | Example | PII Risk | + |------|------|---------|----------| + | last_step | string | "confirmation" | None | + | duration_ms | long | 2100 | None | + | reason | string | "back_pressed" | None | + +## Acceptance Criteria +- [ ] 3 events capture full funnel (start → complete/abandon) +- [ ] No PII in any field +- [ ] Events logged in correct locations +- [ ] Feature flag works (off = no events) +- [ ] Can see events in local debug logs +- [ ] Privacy review: confirmed safe + +## Output Format +Event table → Implementation locations → Local test steps → Privacy checklist +``` + +### Error Telemetry +```markdown +## Objective +Add telemetry to track and categorize errors in [Component] for debugging and alerting. + +## Context +We're seeing user reports of failures but don't have visibility into error rates or types. + +## Constraints +- **No PII**: No stack traces with variable values, no request bodies, no tokens +- Use error hashing (not full messages) to group similar errors +- Include enough context to debug, not enough to identify users +- Behind sampling config (start at 10%) + +## Events to Add + +### Event 1: component_error +- **Purpose:** Track error rate and categorization +- **Trigger:** Caught exception in [Component] +- **Fields:** + | Name | Type | Example | PII Risk | + |------|------|---------|----------| + | error_type | string | "NetworkTimeout" | None | + | error_hash | string | "a1b2c3d4" | None - hash only | + | component | string | "AuthService" | None | + | operation | string | "tokenRefresh" | None | + | http_status | int | 503 | None | + +- **Explicitly excluded (PII risk):** + - error_message (may contain user data) + - stack_trace (may contain file paths with usernames) + - request_url (may contain tokens) + - response_body (may contain PII) + +## Acceptance Criteria +- [ ] Error categorization is useful for debugging +- [ ] No PII in logged fields (verified with examples) +- [ ] Sampling configured to prevent flood +- [ ] Can query by error_type and component +- [ ] Local validation shows events firing + +## Output Format +Event schema → Exclusion list (what NOT to log) → Implementation → Validation +``` + +## PII Reference: What NOT to Log + +Always verify against this list: + +| Field Type | Risk | Alternative | +|------------|------|-------------| +| Email | PII | Don't log, or hash | +| Phone number | PII | Don't log | +| Username / Display name | PII | Don't log | +| Device ID | Tracking | Don't log, or hash | +| IP address | Location/Identity | Don't log | +| Full stack trace | May contain PII | Use error hash | +| Request/Response body | May contain credentials | Log operation name only | +| File paths | May contain username | Use relative paths | +| Tokens / Credentials | Security | Never log | +| Account ID | Semi-PII | Hash if needed | + +## Key Constraints for Telemetry + +Always include explicit PII prohibition: + +```markdown +- **No PII**: Do not log email, phone, username, device ID, IP address, or tokens +- Use existing telemetry infrastructure (SharedCoreLibrary logging) +- Behind feature flag or sampling configuration +- Each event answers a specific question (no "log everything") +- Include local validation steps +``` diff --git a/.github/skills/skill-creator/LICENSE.txt b/.github/skills/skill-creator/LICENSE.txt new file mode 100644 index 00000000..7a4a3ea2 --- /dev/null +++ b/.github/skills/skill-creator/LICENSE.txt @@ -0,0 +1,202 @@ + + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. \ No newline at end of file diff --git a/.github/skills/skill-creator/SKILL.md b/.github/skills/skill-creator/SKILL.md new file mode 100644 index 00000000..b7f86598 --- /dev/null +++ b/.github/skills/skill-creator/SKILL.md @@ -0,0 +1,356 @@ +--- +name: skill-creator +description: Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations. +license: Complete terms in LICENSE.txt +--- + +# Skill Creator + +This skill provides guidance for creating effective skills. + +## About Skills + +Skills are modular, self-contained packages that extend Claude's capabilities by providing +specialized knowledge, workflows, and tools. Think of them as "onboarding guides" for specific +domains or tasks—they transform Claude from a general-purpose agent into a specialized agent +equipped with procedural knowledge that no model can fully possess. + +### What Skills Provide + +1. Specialized workflows - Multi-step procedures for specific domains +2. Tool integrations - Instructions for working with specific file formats or APIs +3. Domain expertise - Company-specific knowledge, schemas, business logic +4. Bundled resources - Scripts, references, and assets for complex and repetitive tasks + +## Core Principles + +### Concise is Key + +The context window is a public good. Skills share the context window with everything else Claude needs: system prompt, conversation history, other Skills' metadata, and the actual user request. + +**Default assumption: Claude is already very smart.** Only add context Claude doesn't already have. Challenge each piece of information: "Does Claude really need this explanation?" and "Does this paragraph justify its token cost?" + +Prefer concise examples over verbose explanations. + +### Set Appropriate Degrees of Freedom + +Match the level of specificity to the task's fragility and variability: + +**High freedom (text-based instructions)**: Use when multiple approaches are valid, decisions depend on context, or heuristics guide the approach. + +**Medium freedom (pseudocode or scripts with parameters)**: Use when a preferred pattern exists, some variation is acceptable, or configuration affects behavior. + +**Low freedom (specific scripts, few parameters)**: Use when operations are fragile and error-prone, consistency is critical, or a specific sequence must be followed. + +Think of Claude as exploring a path: a narrow bridge with cliffs needs specific guardrails (low freedom), while an open field allows many routes (high freedom). + +### Anatomy of a Skill + +Every skill consists of a required SKILL.md file and optional bundled resources: + +``` +skill-name/ +├── SKILL.md (required) +│ ├── YAML frontmatter metadata (required) +│ │ ├── name: (required) +│ │ └── description: (required) +│ └── Markdown instructions (required) +└── Bundled Resources (optional) + ├── scripts/ - Executable code (Python/Bash/etc.) + ├── references/ - Documentation intended to be loaded into context as needed + └── assets/ - Files used in output (templates, icons, fonts, etc.) +``` + +#### SKILL.md (required) + +Every SKILL.md consists of: + +- **Frontmatter** (YAML): Contains `name` and `description` fields. These are the only fields that Claude reads to determine when the skill gets used, thus it is very important to be clear and comprehensive in describing what the skill is, and when it should be used. +- **Body** (Markdown): Instructions and guidance for using the skill. Only loaded AFTER the skill triggers (if at all). + +#### Bundled Resources (optional) + +##### Scripts (`scripts/`) + +Executable code (Python/Bash/etc.) for tasks that require deterministic reliability or are repeatedly rewritten. + +- **When to include**: When the same code is being rewritten repeatedly or deterministic reliability is needed +- **Example**: `scripts/rotate_pdf.py` for PDF rotation tasks +- **Benefits**: Token efficient, deterministic, may be executed without loading into context +- **Note**: Scripts may still need to be read by Claude for patching or environment-specific adjustments + +##### References (`references/`) + +Documentation and reference material intended to be loaded as needed into context to inform Claude's process and thinking. + +- **When to include**: For documentation that Claude should reference while working +- **Examples**: `references/finance.md` for financial schemas, `references/mnda.md` for company NDA template, `references/policies.md` for company policies, `references/api_docs.md` for API specifications +- **Use cases**: Database schemas, API documentation, domain knowledge, company policies, detailed workflow guides +- **Benefits**: Keeps SKILL.md lean, loaded only when Claude determines it's needed +- **Best practice**: If files are large (>10k words), include grep search patterns in SKILL.md +- **Avoid duplication**: Information should live in either SKILL.md or references files, not both. Prefer references files for detailed information unless it's truly core to the skill—this keeps SKILL.md lean while making information discoverable without hogging the context window. Keep only essential procedural instructions and workflow guidance in SKILL.md; move detailed reference material, schemas, and examples to references files. + +##### Assets (`assets/`) + +Files not intended to be loaded into context, but rather used within the output Claude produces. + +- **When to include**: When the skill needs files that will be used in the final output +- **Examples**: `assets/logo.png` for brand assets, `assets/slides.pptx` for PowerPoint templates, `assets/frontend-template/` for HTML/React boilerplate, `assets/font.ttf` for typography +- **Use cases**: Templates, images, icons, boilerplate code, fonts, sample documents that get copied or modified +- **Benefits**: Separates output resources from documentation, enables Claude to use files without loading them into context + +#### What to Not Include in a Skill + +A skill should only contain essential files that directly support its functionality. Do NOT create extraneous documentation or auxiliary files, including: + +- README.md +- INSTALLATION_GUIDE.md +- QUICK_REFERENCE.md +- CHANGELOG.md +- etc. + +The skill should only contain the information needed for an AI agent to do the job at hand. It should not contain auxilary context about the process that went into creating it, setup and testing procedures, user-facing documentation, etc. Creating additional documentation files just adds clutter and confusion. + +### Progressive Disclosure Design Principle + +Skills use a three-level loading system to manage context efficiently: + +1. **Metadata (name + description)** - Always in context (~100 words) +2. **SKILL.md body** - When skill triggers (<5k words) +3. **Bundled resources** - As needed by Claude (Unlimited because scripts can be executed without reading into context window) + +#### Progressive Disclosure Patterns + +Keep SKILL.md body to the essentials and under 500 lines to minimize context bloat. Split content into separate files when approaching this limit. When splitting out content into other files, it is very important to reference them from SKILL.md and describe clearly when to read them, to ensure the reader of the skill knows they exist and when to use them. + +**Key principle:** When a skill supports multiple variations, frameworks, or options, keep only the core workflow and selection guidance in SKILL.md. Move variant-specific details (patterns, examples, configuration) into separate reference files. + +**Pattern 1: High-level guide with references** + +```markdown +# PDF Processing + +## Quick start + +Extract text with pdfplumber: +[code example] + +## Advanced features + +- **Form filling**: See [FORMS.md](FORMS.md) for complete guide +- **API reference**: See [REFERENCE.md](REFERENCE.md) for all methods +- **Examples**: See [EXAMPLES.md](EXAMPLES.md) for common patterns +``` + +Claude loads FORMS.md, REFERENCE.md, or EXAMPLES.md only when needed. + +**Pattern 2: Domain-specific organization** + +For Skills with multiple domains, organize content by domain to avoid loading irrelevant context: + +``` +bigquery-skill/ +├── SKILL.md (overview and navigation) +└── reference/ + ├── finance.md (revenue, billing metrics) + ├── sales.md (opportunities, pipeline) + ├── product.md (API usage, features) + └── marketing.md (campaigns, attribution) +``` + +When a user asks about sales metrics, Claude only reads sales.md. + +Similarly, for skills supporting multiple frameworks or variants, organize by variant: + +``` +cloud-deploy/ +├── SKILL.md (workflow + provider selection) +└── references/ + ├── aws.md (AWS deployment patterns) + ├── gcp.md (GCP deployment patterns) + └── azure.md (Azure deployment patterns) +``` + +When the user chooses AWS, Claude only reads aws.md. + +**Pattern 3: Conditional details** + +Show basic content, link to advanced content: + +```markdown +# DOCX Processing + +## Creating documents + +Use docx-js for new documents. See [DOCX-JS.md](DOCX-JS.md). + +## Editing documents + +For simple edits, modify the XML directly. + +**For tracked changes**: See [REDLINING.md](REDLINING.md) +**For OOXML details**: See [OOXML.md](OOXML.md) +``` + +Claude reads REDLINING.md or OOXML.md only when the user needs those features. + +**Important guidelines:** + +- **Avoid deeply nested references** - Keep references one level deep from SKILL.md. All reference files should link directly from SKILL.md. +- **Structure longer reference files** - For files longer than 100 lines, include a table of contents at the top so Claude can see the full scope when previewing. + +## Skill Creation Process + +Skill creation involves these steps: + +1. Understand the skill with concrete examples +2. Plan reusable skill contents (scripts, references, assets) +3. Initialize the skill (run init_skill.py) +4. Edit the skill (implement resources and write SKILL.md) +5. Package the skill (run package_skill.py) +6. Iterate based on real usage + +Follow these steps in order, skipping only if there is a clear reason why they are not applicable. + +### Step 1: Understanding the Skill with Concrete Examples + +Skip this step only when the skill's usage patterns are already clearly understood. It remains valuable even when working with an existing skill. + +To create an effective skill, clearly understand concrete examples of how the skill will be used. This understanding can come from either direct user examples or generated examples that are validated with user feedback. + +For example, when building an image-editor skill, relevant questions include: + +- "What functionality should the image-editor skill support? Editing, rotating, anything else?" +- "Can you give some examples of how this skill would be used?" +- "I can imagine users asking for things like 'Remove the red-eye from this image' or 'Rotate this image'. Are there other ways you imagine this skill being used?" +- "What would a user say that should trigger this skill?" + +To avoid overwhelming users, avoid asking too many questions in a single message. Start with the most important questions and follow up as needed for better effectiveness. + +Conclude this step when there is a clear sense of the functionality the skill should support. + +### Step 2: Planning the Reusable Skill Contents + +To turn concrete examples into an effective skill, analyze each example by: + +1. Considering how to execute on the example from scratch +2. Identifying what scripts, references, and assets would be helpful when executing these workflows repeatedly + +Example: When building a `pdf-editor` skill to handle queries like "Help me rotate this PDF," the analysis shows: + +1. Rotating a PDF requires re-writing the same code each time +2. A `scripts/rotate_pdf.py` script would be helpful to store in the skill + +Example: When designing a `frontend-webapp-builder` skill for queries like "Build me a todo app" or "Build me a dashboard to track my steps," the analysis shows: + +1. Writing a frontend webapp requires the same boilerplate HTML/React each time +2. An `assets/hello-world/` template containing the boilerplate HTML/React project files would be helpful to store in the skill + +Example: When building a `big-query` skill to handle queries like "How many users have logged in today?" the analysis shows: + +1. Querying BigQuery requires re-discovering the table schemas and relationships each time +2. A `references/schema.md` file documenting the table schemas would be helpful to store in the skill + +To establish the skill's contents, analyze each concrete example to create a list of the reusable resources to include: scripts, references, and assets. + +### Step 3: Initializing the Skill + +At this point, it is time to actually create the skill. + +Skip this step only if the skill being developed already exists, and iteration or packaging is needed. In this case, continue to the next step. + +When creating a new skill from scratch, always run the `init_skill.py` script. The script conveniently generates a new template skill directory that automatically includes everything a skill requires, making the skill creation process much more efficient and reliable. + +Usage: + +```bash +scripts/init_skill.py --path +``` + +The script: + +- Creates the skill directory at the specified path +- Generates a SKILL.md template with proper frontmatter and TODO placeholders +- Creates example resource directories: `scripts/`, `references/`, and `assets/` +- Adds example files in each directory that can be customized or deleted + +After initialization, customize or remove the generated SKILL.md and example files as needed. + +### Step 4: Edit the Skill + +When editing the (newly-generated or existing) skill, remember that the skill is being created for another instance of Claude to use. Include information that would be beneficial and non-obvious to Claude. Consider what procedural knowledge, domain-specific details, or reusable assets would help another Claude instance execute these tasks more effectively. + +#### Learn Proven Design Patterns + +Consult these helpful guides based on your skill's needs: + +- **Multi-step processes**: See references/workflows.md for sequential workflows and conditional logic +- **Specific output formats or quality standards**: See references/output-patterns.md for template and example patterns + +These files contain established best practices for effective skill design. + +#### Start with Reusable Skill Contents + +To begin implementation, start with the reusable resources identified above: `scripts/`, `references/`, and `assets/` files. Note that this step may require user input. For example, when implementing a `brand-guidelines` skill, the user may need to provide brand assets or templates to store in `assets/`, or documentation to store in `references/`. + +Added scripts must be tested by actually running them to ensure there are no bugs and that the output matches what is expected. If there are many similar scripts, only a representative sample needs to be tested to ensure confidence that they all work while balancing time to completion. + +Any example files and directories not needed for the skill should be deleted. The initialization script creates example files in `scripts/`, `references/`, and `assets/` to demonstrate structure, but most skills won't need all of them. + +#### Update SKILL.md + +**Writing Guidelines:** Always use imperative/infinitive form. + +##### Frontmatter + +Write the YAML frontmatter with `name` and `description`: + +- `name`: The skill name +- `description`: This is the primary triggering mechanism for your skill, and helps Claude understand when to use the skill. + - Include both what the Skill does and specific triggers/contexts for when to use it. + - Include all "when to use" information here - Not in the body. The body is only loaded after triggering, so "When to Use This Skill" sections in the body are not helpful to Claude. + - Example description for a `docx` skill: "Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. Use when Claude needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks" + +Do not include any other fields in YAML frontmatter. + +##### Body + +Write instructions for using the skill and its bundled resources. + +### Step 5: Packaging a Skill + +Once development of the skill is complete, it must be packaged into a distributable .skill file that gets shared with the user. The packaging process automatically validates the skill first to ensure it meets all requirements: + +```bash +scripts/package_skill.py +``` + +Optional output directory specification: + +```bash +scripts/package_skill.py ./dist +``` + +The packaging script will: + +1. **Validate** the skill automatically, checking: + + - YAML frontmatter format and required fields + - Skill naming conventions and directory structure + - Description completeness and quality + - File organization and resource references + +2. **Package** the skill if validation passes, creating a .skill file named after the skill (e.g., `my-skill.skill`) that includes all files and maintains the proper directory structure for distribution. The .skill file is a zip file with a .skill extension. + +If validation fails, the script will report the errors and exit without creating a package. Fix any validation errors and run the packaging command again. + +### Step 6: Iterate + +After testing the skill, users may request improvements. Often this happens right after using the skill, with fresh context of how the skill performed. + +**Iteration workflow:** + +1. Use the skill on real tasks +2. Notice struggles or inefficiencies +3. Identify how SKILL.md or bundled resources should be updated +4. Implement changes and test again diff --git a/.github/skills/skill-creator/references/output-patterns.md b/.github/skills/skill-creator/references/output-patterns.md new file mode 100644 index 00000000..073ddda5 --- /dev/null +++ b/.github/skills/skill-creator/references/output-patterns.md @@ -0,0 +1,82 @@ +# Output Patterns + +Use these patterns when skills need to produce consistent, high-quality output. + +## Template Pattern + +Provide templates for output format. Match the level of strictness to your needs. + +**For strict requirements (like API responses or data formats):** + +```markdown +## Report structure + +ALWAYS use this exact template structure: + +# [Analysis Title] + +## Executive summary +[One-paragraph overview of key findings] + +## Key findings +- Finding 1 with supporting data +- Finding 2 with supporting data +- Finding 3 with supporting data + +## Recommendations +1. Specific actionable recommendation +2. Specific actionable recommendation +``` + +**For flexible guidance (when adaptation is useful):** + +```markdown +## Report structure + +Here is a sensible default format, but use your best judgment: + +# [Analysis Title] + +## Executive summary +[Overview] + +## Key findings +[Adapt sections based on what you discover] + +## Recommendations +[Tailor to the specific context] + +Adjust sections as needed for the specific analysis type. +``` + +## Examples Pattern + +For skills where output quality depends on seeing examples, provide input/output pairs: + +```markdown +## Commit message format + +Generate commit messages following these examples: + +**Example 1:** +Input: Added user authentication with JWT tokens +Output: +``` +feat(auth): implement JWT-based authentication + +Add login endpoint and token validation middleware +``` + +**Example 2:** +Input: Fixed bug where dates displayed incorrectly in reports +Output: +``` +fix(reports): correct date formatting in timezone conversion + +Use UTC timestamps consistently across report generation +``` + +Follow this style: type(scope): brief description, then detailed explanation. +``` + +Examples help Claude understand the desired style and level of detail more clearly than descriptions alone. diff --git a/.github/skills/skill-creator/references/workflows.md b/.github/skills/skill-creator/references/workflows.md new file mode 100644 index 00000000..a350c3cc --- /dev/null +++ b/.github/skills/skill-creator/references/workflows.md @@ -0,0 +1,28 @@ +# Workflow Patterns + +## Sequential Workflows + +For complex tasks, break operations into clear, sequential steps. It is often helpful to give Claude an overview of the process towards the beginning of SKILL.md: + +```markdown +Filling a PDF form involves these steps: + +1. Analyze the form (run analyze_form.py) +2. Create field mapping (edit fields.json) +3. Validate mapping (run validate_fields.py) +4. Fill the form (run fill_form.py) +5. Verify output (run verify_output.py) +``` + +## Conditional Workflows + +For tasks with branching logic, guide Claude through decision points: + +```markdown +1. Determine the modification type: + **Creating new content?** → Follow "Creation workflow" below + **Editing existing content?** → Follow "Editing workflow" below + +2. Creation workflow: [steps] +3. Editing workflow: [steps] +``` \ No newline at end of file diff --git a/.github/skills/skill-creator/scripts/init_skill.py b/.github/skills/skill-creator/scripts/init_skill.py new file mode 100644 index 00000000..329ad4e5 --- /dev/null +++ b/.github/skills/skill-creator/scripts/init_skill.py @@ -0,0 +1,303 @@ +#!/usr/bin/env python3 +""" +Skill Initializer - Creates a new skill from template + +Usage: + init_skill.py --path + +Examples: + init_skill.py my-new-skill --path skills/public + init_skill.py my-api-helper --path skills/private + init_skill.py custom-skill --path /custom/location +""" + +import sys +from pathlib import Path + + +SKILL_TEMPLATE = """--- +name: {skill_name} +description: [TODO: Complete and informative explanation of what the skill does and when to use it. Include WHEN to use this skill - specific scenarios, file types, or tasks that trigger it.] +--- + +# {skill_title} + +## Overview + +[TODO: 1-2 sentences explaining what this skill enables] + +## Structuring This Skill + +[TODO: Choose the structure that best fits this skill's purpose. Common patterns: + +**1. Workflow-Based** (best for sequential processes) +- Works well when there are clear step-by-step procedures +- Example: DOCX skill with "Workflow Decision Tree" → "Reading" → "Creating" → "Editing" +- Structure: ## Overview → ## Workflow Decision Tree → ## Step 1 → ## Step 2... + +**2. Task-Based** (best for tool collections) +- Works well when the skill offers different operations/capabilities +- Example: PDF skill with "Quick Start" → "Merge PDFs" → "Split PDFs" → "Extract Text" +- Structure: ## Overview → ## Quick Start → ## Task Category 1 → ## Task Category 2... + +**3. Reference/Guidelines** (best for standards or specifications) +- Works well for brand guidelines, coding standards, or requirements +- Example: Brand styling with "Brand Guidelines" → "Colors" → "Typography" → "Features" +- Structure: ## Overview → ## Guidelines → ## Specifications → ## Usage... + +**4. Capabilities-Based** (best for integrated systems) +- Works well when the skill provides multiple interrelated features +- Example: Product Management with "Core Capabilities" → numbered capability list +- Structure: ## Overview → ## Core Capabilities → ### 1. Feature → ### 2. Feature... + +Patterns can be mixed and matched as needed. Most skills combine patterns (e.g., start with task-based, add workflow for complex operations). + +Delete this entire "Structuring This Skill" section when done - it's just guidance.] + +## [TODO: Replace with the first main section based on chosen structure] + +[TODO: Add content here. See examples in existing skills: +- Code samples for technical skills +- Decision trees for complex workflows +- Concrete examples with realistic user requests +- References to scripts/templates/references as needed] + +## Resources + +This skill includes example resource directories that demonstrate how to organize different types of bundled resources: + +### scripts/ +Executable code (Python/Bash/etc.) that can be run directly to perform specific operations. + +**Examples from other skills:** +- PDF skill: `fill_fillable_fields.py`, `extract_form_field_info.py` - utilities for PDF manipulation +- DOCX skill: `document.py`, `utilities.py` - Python modules for document processing + +**Appropriate for:** Python scripts, shell scripts, or any executable code that performs automation, data processing, or specific operations. + +**Note:** Scripts may be executed without loading into context, but can still be read by Claude for patching or environment adjustments. + +### references/ +Documentation and reference material intended to be loaded into context to inform Claude's process and thinking. + +**Examples from other skills:** +- Product management: `communication.md`, `context_building.md` - detailed workflow guides +- BigQuery: API reference documentation and query examples +- Finance: Schema documentation, company policies + +**Appropriate for:** In-depth documentation, API references, database schemas, comprehensive guides, or any detailed information that Claude should reference while working. + +### assets/ +Files not intended to be loaded into context, but rather used within the output Claude produces. + +**Examples from other skills:** +- Brand styling: PowerPoint template files (.pptx), logo files +- Frontend builder: HTML/React boilerplate project directories +- Typography: Font files (.ttf, .woff2) + +**Appropriate for:** Templates, boilerplate code, document templates, images, icons, fonts, or any files meant to be copied or used in the final output. + +--- + +**Any unneeded directories can be deleted.** Not every skill requires all three types of resources. +""" + +EXAMPLE_SCRIPT = '''#!/usr/bin/env python3 +""" +Example helper script for {skill_name} + +This is a placeholder script that can be executed directly. +Replace with actual implementation or delete if not needed. + +Example real scripts from other skills: +- pdf/scripts/fill_fillable_fields.py - Fills PDF form fields +- pdf/scripts/convert_pdf_to_images.py - Converts PDF pages to images +""" + +def main(): + print("This is an example script for {skill_name}") + # TODO: Add actual script logic here + # This could be data processing, file conversion, API calls, etc. + +if __name__ == "__main__": + main() +''' + +EXAMPLE_REFERENCE = """# Reference Documentation for {skill_title} + +This is a placeholder for detailed reference documentation. +Replace with actual reference content or delete if not needed. + +Example real reference docs from other skills: +- product-management/references/communication.md - Comprehensive guide for status updates +- product-management/references/context_building.md - Deep-dive on gathering context +- bigquery/references/ - API references and query examples + +## When Reference Docs Are Useful + +Reference docs are ideal for: +- Comprehensive API documentation +- Detailed workflow guides +- Complex multi-step processes +- Information too lengthy for main SKILL.md +- Content that's only needed for specific use cases + +## Structure Suggestions + +### API Reference Example +- Overview +- Authentication +- Endpoints with examples +- Error codes +- Rate limits + +### Workflow Guide Example +- Prerequisites +- Step-by-step instructions +- Common patterns +- Troubleshooting +- Best practices +""" + +EXAMPLE_ASSET = """# Example Asset File + +This placeholder represents where asset files would be stored. +Replace with actual asset files (templates, images, fonts, etc.) or delete if not needed. + +Asset files are NOT intended to be loaded into context, but rather used within +the output Claude produces. + +Example asset files from other skills: +- Brand guidelines: logo.png, slides_template.pptx +- Frontend builder: hello-world/ directory with HTML/React boilerplate +- Typography: custom-font.ttf, font-family.woff2 +- Data: sample_data.csv, test_dataset.json + +## Common Asset Types + +- Templates: .pptx, .docx, boilerplate directories +- Images: .png, .jpg, .svg, .gif +- Fonts: .ttf, .otf, .woff, .woff2 +- Boilerplate code: Project directories, starter files +- Icons: .ico, .svg +- Data files: .csv, .json, .xml, .yaml + +Note: This is a text placeholder. Actual assets can be any file type. +""" + + +def title_case_skill_name(skill_name): + """Convert hyphenated skill name to Title Case for display.""" + return ' '.join(word.capitalize() for word in skill_name.split('-')) + + +def init_skill(skill_name, path): + """ + Initialize a new skill directory with template SKILL.md. + + Args: + skill_name: Name of the skill + path: Path where the skill directory should be created + + Returns: + Path to created skill directory, or None if error + """ + # Determine skill directory path + skill_dir = Path(path).resolve() / skill_name + + # Check if directory already exists + if skill_dir.exists(): + print(f"❌ Error: Skill directory already exists: {skill_dir}") + return None + + # Create skill directory + try: + skill_dir.mkdir(parents=True, exist_ok=False) + print(f"✅ Created skill directory: {skill_dir}") + except Exception as e: + print(f"❌ Error creating directory: {e}") + return None + + # Create SKILL.md from template + skill_title = title_case_skill_name(skill_name) + skill_content = SKILL_TEMPLATE.format( + skill_name=skill_name, + skill_title=skill_title + ) + + skill_md_path = skill_dir / 'SKILL.md' + try: + skill_md_path.write_text(skill_content) + print("✅ Created SKILL.md") + except Exception as e: + print(f"❌ Error creating SKILL.md: {e}") + return None + + # Create resource directories with example files + try: + # Create scripts/ directory with example script + scripts_dir = skill_dir / 'scripts' + scripts_dir.mkdir(exist_ok=True) + example_script = scripts_dir / 'example.py' + example_script.write_text(EXAMPLE_SCRIPT.format(skill_name=skill_name)) + example_script.chmod(0o755) + print("✅ Created scripts/example.py") + + # Create references/ directory with example reference doc + references_dir = skill_dir / 'references' + references_dir.mkdir(exist_ok=True) + example_reference = references_dir / 'api_reference.md' + example_reference.write_text(EXAMPLE_REFERENCE.format(skill_title=skill_title)) + print("✅ Created references/api_reference.md") + + # Create assets/ directory with example asset placeholder + assets_dir = skill_dir / 'assets' + assets_dir.mkdir(exist_ok=True) + example_asset = assets_dir / 'example_asset.txt' + example_asset.write_text(EXAMPLE_ASSET) + print("✅ Created assets/example_asset.txt") + except Exception as e: + print(f"❌ Error creating resource directories: {e}") + return None + + # Print next steps + print(f"\n✅ Skill '{skill_name}' initialized successfully at {skill_dir}") + print("\nNext steps:") + print("1. Edit SKILL.md to complete the TODO items and update the description") + print("2. Customize or delete the example files in scripts/, references/, and assets/") + print("3. Run the validator when ready to check the skill structure") + + return skill_dir + + +def main(): + if len(sys.argv) < 4 or sys.argv[2] != '--path': + print("Usage: init_skill.py --path ") + print("\nSkill name requirements:") + print(" - Hyphen-case identifier (e.g., 'data-analyzer')") + print(" - Lowercase letters, digits, and hyphens only") + print(" - Max 40 characters") + print(" - Must match directory name exactly") + print("\nExamples:") + print(" init_skill.py my-new-skill --path skills/public") + print(" init_skill.py my-api-helper --path skills/private") + print(" init_skill.py custom-skill --path /custom/location") + sys.exit(1) + + skill_name = sys.argv[1] + path = sys.argv[3] + + print(f"🚀 Initializing skill: {skill_name}") + print(f" Location: {path}") + print() + + result = init_skill(skill_name, path) + + if result: + sys.exit(0) + else: + sys.exit(1) + + +if __name__ == "__main__": + main() diff --git a/.github/skills/skill-creator/scripts/package_skill.py b/.github/skills/skill-creator/scripts/package_skill.py new file mode 100644 index 00000000..5cd36cb1 --- /dev/null +++ b/.github/skills/skill-creator/scripts/package_skill.py @@ -0,0 +1,110 @@ +#!/usr/bin/env python3 +""" +Skill Packager - Creates a distributable .skill file of a skill folder + +Usage: + python utils/package_skill.py [output-directory] + +Example: + python utils/package_skill.py skills/public/my-skill + python utils/package_skill.py skills/public/my-skill ./dist +""" + +import sys +import zipfile +from pathlib import Path +from quick_validate import validate_skill + + +def package_skill(skill_path, output_dir=None): + """ + Package a skill folder into a .skill file. + + Args: + skill_path: Path to the skill folder + output_dir: Optional output directory for the .skill file (defaults to current directory) + + Returns: + Path to the created .skill file, or None if error + """ + skill_path = Path(skill_path).resolve() + + # Validate skill folder exists + if not skill_path.exists(): + print(f"❌ Error: Skill folder not found: {skill_path}") + return None + + if not skill_path.is_dir(): + print(f"❌ Error: Path is not a directory: {skill_path}") + return None + + # Validate SKILL.md exists + skill_md = skill_path / "SKILL.md" + if not skill_md.exists(): + print(f"❌ Error: SKILL.md not found in {skill_path}") + return None + + # Run validation before packaging + print("🔍 Validating skill...") + valid, message = validate_skill(skill_path) + if not valid: + print(f"❌ Validation failed: {message}") + print(" Please fix the validation errors before packaging.") + return None + print(f"✅ {message}\n") + + # Determine output location + skill_name = skill_path.name + if output_dir: + output_path = Path(output_dir).resolve() + output_path.mkdir(parents=True, exist_ok=True) + else: + output_path = Path.cwd() + + skill_filename = output_path / f"{skill_name}.skill" + + # Create the .skill file (zip format) + try: + with zipfile.ZipFile(skill_filename, 'w', zipfile.ZIP_DEFLATED) as zipf: + # Walk through the skill directory + for file_path in skill_path.rglob('*'): + if file_path.is_file(): + # Calculate the relative path within the zip + arcname = file_path.relative_to(skill_path.parent) + zipf.write(file_path, arcname) + print(f" Added: {arcname}") + + print(f"\n✅ Successfully packaged skill to: {skill_filename}") + return skill_filename + + except Exception as e: + print(f"❌ Error creating .skill file: {e}") + return None + + +def main(): + if len(sys.argv) < 2: + print("Usage: python utils/package_skill.py [output-directory]") + print("\nExample:") + print(" python utils/package_skill.py skills/public/my-skill") + print(" python utils/package_skill.py skills/public/my-skill ./dist") + sys.exit(1) + + skill_path = sys.argv[1] + output_dir = sys.argv[2] if len(sys.argv) > 2 else None + + print(f"📦 Packaging skill: {skill_path}") + if output_dir: + print(f" Output directory: {output_dir}") + print() + + result = package_skill(skill_path, output_dir) + + if result: + sys.exit(0) + else: + sys.exit(1) + + +if __name__ == "__main__": + main() diff --git a/.github/skills/skill-creator/scripts/quick_validate.py b/.github/skills/skill-creator/scripts/quick_validate.py new file mode 100644 index 00000000..d9fbeb75 --- /dev/null +++ b/.github/skills/skill-creator/scripts/quick_validate.py @@ -0,0 +1,95 @@ +#!/usr/bin/env python3 +""" +Quick validation script for skills - minimal version +""" + +import sys +import os +import re +import yaml +from pathlib import Path + +def validate_skill(skill_path): + """Basic validation of a skill""" + skill_path = Path(skill_path) + + # Check SKILL.md exists + skill_md = skill_path / 'SKILL.md' + if not skill_md.exists(): + return False, "SKILL.md not found" + + # Read and validate frontmatter + content = skill_md.read_text() + if not content.startswith('---'): + return False, "No YAML frontmatter found" + + # Extract frontmatter + match = re.match(r'^---\n(.*?)\n---', content, re.DOTALL) + if not match: + return False, "Invalid frontmatter format" + + frontmatter_text = match.group(1) + + # Parse YAML frontmatter + try: + frontmatter = yaml.safe_load(frontmatter_text) + if not isinstance(frontmatter, dict): + return False, "Frontmatter must be a YAML dictionary" + except yaml.YAMLError as e: + return False, f"Invalid YAML in frontmatter: {e}" + + # Define allowed properties + ALLOWED_PROPERTIES = {'name', 'description', 'license', 'allowed-tools', 'metadata'} + + # Check for unexpected properties (excluding nested keys under metadata) + unexpected_keys = set(frontmatter.keys()) - ALLOWED_PROPERTIES + if unexpected_keys: + return False, ( + f"Unexpected key(s) in SKILL.md frontmatter: {', '.join(sorted(unexpected_keys))}. " + f"Allowed properties are: {', '.join(sorted(ALLOWED_PROPERTIES))}" + ) + + # Check required fields + if 'name' not in frontmatter: + return False, "Missing 'name' in frontmatter" + if 'description' not in frontmatter: + return False, "Missing 'description' in frontmatter" + + # Extract name for validation + name = frontmatter.get('name', '') + if not isinstance(name, str): + return False, f"Name must be a string, got {type(name).__name__}" + name = name.strip() + if name: + # Check naming convention (hyphen-case: lowercase with hyphens) + if not re.match(r'^[a-z0-9-]+$', name): + return False, f"Name '{name}' should be hyphen-case (lowercase letters, digits, and hyphens only)" + if name.startswith('-') or name.endswith('-') or '--' in name: + return False, f"Name '{name}' cannot start/end with hyphen or contain consecutive hyphens" + # Check name length (max 64 characters per spec) + if len(name) > 64: + return False, f"Name is too long ({len(name)} characters). Maximum is 64 characters." + + # Extract and validate description + description = frontmatter.get('description', '') + if not isinstance(description, str): + return False, f"Description must be a string, got {type(description).__name__}" + description = description.strip() + if description: + # Check for angle brackets + if '<' in description or '>' in description: + return False, "Description cannot contain angle brackets (< or >)" + # Check description length (max 1024 characters per spec) + if len(description) > 1024: + return False, f"Description is too long ({len(description)} characters). Maximum is 1024 characters." + + return True, "Skill is valid!" + +if __name__ == "__main__": + if len(sys.argv) != 2: + print("Usage: python quick_validate.py ") + sys.exit(1) + + valid, message = validate_skill(sys.argv[1]) + print(message) + sys.exit(0 if valid else 1) \ No newline at end of file