Skip to content

Commit 4ce4e8a

Browse files
Incident reviews 20251119 (#25)
1 parent aefe260 commit 4ce4e8a

File tree

1 file changed

+63
-2
lines changed

1 file changed

+63
-2
lines changed

docs/releases/status.md

Lines changed: 63 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,74 @@ Lambda Feedback is a cloud-native application that is available with full servic
22

33
This page contains information about any known incidents where service was interrupted. The page begain in November 2024 following a significant incident. The purpose is to be informative, transparent, and ensure lessons are always learned so that service improves over time.
44

5-
The Severity of incidents is the product of number of users affected (for 100 users, N = 1), magnitude of the effect (scale 1-5 from workable to no service), and the duration (in hours). Severity below 1 is LOW, between 1 and 100 is SIGNIFICANT, and above 100 is HIGH. The severity is used to decide how much we invest in preventative measures, detection, mitigation plans, and rehearsals.
5+
The Severity of incidents is the product of:
6+
7+
- number of users affected (for 100 users, N = 1),
8+
- magnitude of the effect (scale 1-5 from workable to no service),
9+
- duration (in hours).
10+
11+
Severity:
12+
13+
- x < 1 is LOW
14+
- 1 < x < 100 is SIGNIFICANT
15+
- x > 100 is HIGH.
16+
17+
The severity is used to decide how much we invest in preventative measures, detection, mitigation plans, and rehearsals.
18+
19+
## 2025 November 18th: Some evaluation functions failing (Severity: LOW):
20+
21+
Some evaluation functions returned errors.
22+
23+
### Timeline (UK / GMT)
24+
25+
The application was fully available during this time period.
26+
27+
2025/11/18 21:18 GMT: some but not all evaluation functions (external microservices) failed. Investigation initiated and message added on home page
28+
2025/11/18 21:39 GMT: home page updated to users that the cause was identified.
29+
2025/11/18 21:45 GMT: issue resolved. Home page updated.
30+
31+
### Analysis
32+
33+
Some of our evaluation functions still use an old version of our baselayer, which calls GitHub to retrieve a schema and validate inputs. GitHub git services were down (https://www.githubstatus.com/incidents/5q7nmlxz30sk), which meant that those of our functions that call GitHub could not validate their schemas and therefore failed. Other evaluation functions had previously been updated to remove the need to call GitHub and were therefore not affected by the issue.
34+
35+
The same root cuase meant that we could not push updates to code during the incident, due code being deployed via GitHub. GitHub had announced they were resolving the issue, and when it was resolved our services returned to normal.
36+
37+
### Recommended action
38+
39+
Update all evaluation function baselayers to remove dependency on external calls when validating.
40+
41+
N=1, effect = 2, duration = 0.5. Severity = 1 (LOW)
42+
43+
## 2025 November 10th: Service unresponsive (Severity: SIGNIFICANT):
44+
45+
The application was unresponsive.
46+
47+
### Timeline (UK / GMT)
48+
49+
2025/11/10 14:21 Service became unresponseive, e.g. pages not loading. Reports from users through various channels. Developers began investigating, message sent to Teachers.
50+
51+
2025/11/10 14:28 Service returned to normal. Home page message displayed to inform users.
52+
53+
### Analysis
54+
55+
During the period of unresponsiveness, the key symptoms within the system were overloading the CPU of the servers. Error logging and alerts did successfully detect downtime and alert the developer team, who responded. Although developers were looking into the problem, and tried to increase resource to resolve the problem, in fact the autoscaling solved the problem itself.
56+
57+
The underlying cause was a combination of high usage, leading to CPU overload. This type of scenario is normal and correctly triggered autoscaling. The issue in this case was that autoscaling should happen seamlessly, without service interruptions in the intervening period.
58+
59+
### Action taken:
60+
61+
- Decrease the CPU and memory usage level at which scaling is triggered. This increases overall costs but decreases the chance of service interruptions.
62+
- Enhance system logs so that more information is available if a similar event occurs
63+
- Investigate CPU and memory usage to identify opportunities for improvements (outcome: useage is typical for NODE.js applications, no further action)
64+
65+
N=3, effect = 5, duration = 0.15. Severity = 2.25 (SIGNIFICANT)
66+
667

768
## 2025 October 17th: Handwriting input temporarily unavailable (Severity: SIGNIFICANT)
869

970
Handwriting in response areas (but not in the canvas) did not return a preview and could not be submitted. Users received an error in a toast saying that the service would not work. All other services remained operational.
1071

11-
### Timeline (UK / BST)
72+
### Timeline (UK / BST)
1273

1374
2025/10/17 08:24 Handwriting inputs ceased to return previews to the user due to a deployed code change that removed redudant code, but also code that it transpired was required.
1475

0 commit comments

Comments
 (0)