Skip to content

Commit 0c4c5f5

Browse files
Analysis and best practices for whitelisting system logs (#2014)
Signed-off-by: Waleed Malik <ahmedwaleedmalik@gmail.com>
1 parent 7bcdfbe commit 0c4c5f5

File tree

2 files changed

+269
-0
lines changed

2 files changed

+269
-0
lines changed
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
+++
2+
title = "Personally Identifiable Information Analysis: Kubernetes and KubeOne System Logs"
3+
date = 2024-03-06T12:00:00+02:00
4+
weight = 10
5+
+++
6+
7+
This document provides a comprehensive analysis of potential Personally Identifiable Information (PII) and personal data (indirect identifiers) that may be present in system logs from Kubernetes clusters deployed using KubeOne.
8+
9+
**Target Audience**: Platform operators, security teams, compliance officers
10+
11+
**Prerequisites**: Basic understanding of Kubernetes and KubeOne
12+
13+
While KubeOne inherently tries to avoid logging any PII, there are some cases where it is unavoidable and outside the control of the platform operator. This could be a component that KubeOne ships or the underlying Kubernetes components.
14+
15+
## PII Categories (GDPR-Aligned)
16+
17+
System logs from Kubernetes clusters may contain the following types of PII:
18+
19+
### Direct Identifiers
20+
21+
* **Usernames**: Kubernetes usernames, system usernames, service account names
22+
* **Email addresses**: From TLS certificate subjects (CN, O, OU), OIDC claims, audit logs, or user labels
23+
* **IP addresses**: Client IPs
24+
25+
### Indirect Identifiers
26+
27+
* **Resource names**: Pod names, namespace names, deployment names containing user/org identifiers
28+
* Example: `webapp-john-deployment`, `john-doe-dev` namespace
29+
* **Hostnames**: Node hostnames with user or organizational patterns
30+
* Example: `worker-john-prod-01.company.com`
31+
* **Labels and annotations**: Custom metadata that may include user data
32+
* Example: `owner=john.doe@company.com`
33+
* **Volume paths**: Mount paths revealing directory structures with usernames
34+
* Example: `/home/john/data:/data`
35+
36+
### Cloud Provider Identifiers
37+
38+
* **Account IDs**: AWS account IDs, Azure subscription IDs, GCP project IDs
39+
* **Resource IDs**: Instance IDs, VPC IDs, volume IDs, subnet IDs, security group IDs
40+
* **DNS names**: Load balancer DNS, instance DNS names
41+
* **Geographic data**: Availability zones, regions
42+
43+
### Operational Data That May Reveal personal data
44+
45+
* **DNS queries**: Service/pod names in DNS lookups
46+
* **HTTP/gRPC metadata**: URLs, headers, cookies (if Layer 7 visibility enabled in CNI)
47+
* **Error messages**: Often contain detailed context with resource IDs and user identifiers
48+
* **Audit logs**: Comprehensive request/response data including full user context
49+
50+
## Risk Assessment Matrix
51+
52+
| Component | User Identity | IP Addresses | Credentials | Cloud IDs | Risk Level |
53+
|-----------|---------------|--------------|-------------|-----------|------------|
54+
| kube-apiserver | ✅ High | ✅ High | ✅ High | ❌ No | 🔴 **HIGH** |
55+
| kubelet | ⚠️ Medium | ✅ High | ✅ High | ❌ No | 🔴 **HIGH** |
56+
| etcd | ✅ High | ⚠️ Medium | ✅ High | ❌ No | 🔴 **HIGH** |
57+
| Cloud Controller Managers | ❌ No | ✅ High | ✅ High | ✅ High | 🔴 **HIGH** |
58+
| CSI Drivers | ❌ No | ⚠️ Medium | ✅ High | ✅ High | 🔴 **HIGH** |
59+
| Secrets Store CSI | ❌ No | ❌ No | ✅ High | ⚠️ Low | 🔴 **HIGH** |
60+
| Cilium | ⚠️ Medium | ✅ High | ❌ No | ❌ No | 🟡 **MEDIUM-HIGH** |
61+
| kube-controller-manager | ⚠️ Low | ⚠️ Medium | ⚠️ Medium | ⚠️ Medium | 🟡 **MEDIUM** |
62+
| kube-scheduler | ⚠️ Low | ❌ No | ❌ No | ❌ No | 🟡 **MEDIUM** |
63+
| kube-proxy | ❌ No | ✅ High | ❌ No | ❌ No | 🟡 **MEDIUM** |
64+
| CoreDNS | ⚠️ Low | ⚠️ Medium | ❌ No | ❌ No | 🟡 **MEDIUM** |
65+
| Canal | ❌ No | ✅ High | ❌ No | ❌ No | 🟡 **MEDIUM** |
66+
| WeaveNet | ❌ No | ✅ High | ⚠️ Low | ❌ No | 🟡 **MEDIUM** |
67+
| cluster-autoscaler | ⚠️ Low | ⚠️ Low | ⚠️ Low | ✅ High | 🟡 **MEDIUM** |
68+
| NodeLocalDNS | ⚠️ Low | ⚠️ Medium | ❌ No | ❌ No | 🟡 **MEDIUM** |
69+
| metrics-server | ⚠️ Low | ❌ No | ❌ No | ❌ No | 🟢 **LOW-MEDIUM** |
70+
| machine-controller | ⚠️ Low | ❌ No | ⚠️ Low | ✅ High | 🟢 **LOW** |
71+
| operating-system-manager | ⚠️ Low | ❌ No | ❌ No | ⚠️ Low | 🟢 **LOW** |
72+
73+
**Legend**:
74+
75+
* ✅ High: Frequent and detailed PII exposure
76+
* ⚠️ Medium: Moderate PII exposure
77+
* ❌ No: Minimal or no PII exposure
78+
79+
### Understanding Risk Context
80+
81+
While the risk matrix provides a helpful overview of potential PII exposure, it is important to note that the risk is not always proportional to the exposure. For example, a low-risk component may have high exposure if it is combined with a high-risk component.
82+
83+
An example of this would be a component that logs a full Kubernetes resource in case of a validation failure. The Kubernetes resource itself may contain PII, and while the fields that might contain personal data are not directly being referred to in the logs, the full resource is being logged. This results in private data being exposed to the logs. It is always recommended to review and sanitize the logs before sharing them anywhere.
84+
85+
## Log Filtering and Sanitization
86+
87+
### Automated PII Filtering
88+
89+
Implement automated filtering in your log aggregation pipeline to remove PII and personal data from the logs.
90+
91+
#### Use external tools for PII Redaction
92+
93+
* [Presidio](https://microsoft.github.io/presidio/) - A set of tools for data protection and privacy
94+
* [Azure Purview](https://learn.microsoft.com/en-us/purview/information-protection) - A cloud-based data governance service that helps you manage and protect your sensitive data
95+
96+
### Manual PII Filtering - Common patterns to filter
97+
98+
```regex
99+
# Email addresses
100+
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
101+
102+
# IPv4 addresses
103+
\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b
104+
105+
# Basic Auth in URLs
106+
https?://[^:]+:[^@]+@
107+
```
108+
109+
## Best Practices
110+
111+
### Before sharing logs with Kubermatic Support
112+
113+
1. Identify the time range needed (minimize data exposure)
114+
2. Export only relevant namespaces/components
115+
3. Run PII redaction tool or scripts
116+
4. Manual review of first 100 lines to verify redaction
117+
5. Approval from data protection officer (if required)
118+
119+
## Conclusion
120+
121+
### Key Points
122+
123+
1. Kubernetes logs contain significant PII, especially from kube-apiserver, kubelet, etcd, and all cloud provider components
124+
2. Higher log verbosity (v=4-5) dramatically increases PII exposure
125+
3. Cloud provider account identifiers are prevalent in Cloud Controller Managers (CCMs) and CSI drivers
126+
4. Automated filtering tools are essential for safe log sharing at scale
127+
5. Manual review is still necessary to catch context-specific PII
128+
129+
### Best Practice for Support
130+
131+
## Additional Resources
132+
133+
### GDPR and Privacy
134+
135+
* [GDPR Official Text](https://gdpr-info.eu/)
136+
* [Article 29 Working Party Opinion on Personal Data](https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/index_en.htm)
Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
+++
2+
title = "Personally Identifiable Information Analysis: Kubernetes and KKP System Logs"
3+
date = 2024-03-06T12:00:00+02:00
4+
weight = 10
5+
+++
6+
7+
This document provides a comprehensive analysis of potential Personally Identifiable Information (PII) and personal data (indirect identifiers) that may be present in system logs from Kubernetes clusters deployed using Kubermatic Kubernetes Platform (KKP).
8+
9+
**Target Audience**: Platform operators, security teams, compliance officers
10+
11+
**Prerequisites**: Basic understanding of Kubernetes and KKP
12+
13+
While KKP inherently tries to avoid logging any PII, there are some cases where it is unavoidable and outside the control of the platform operator. This could be a component that KKP ships or the underlying Kubernetes components.
14+
15+
## PII Categories (GDPR-Aligned)
16+
17+
System logs from Kubernetes clusters may contain the following types of PII:
18+
19+
### Direct Identifiers
20+
21+
* **Usernames**: Kubernetes usernames, system usernames, service account names
22+
* **Email addresses**: From TLS certificate subjects (CN, O, OU), OIDC claims, audit logs, or user labels
23+
* **IP addresses**: Client IPs
24+
25+
### Indirect Identifiers
26+
27+
* **Resource names**: Pod names, namespace names, deployment names containing user/org identifiers
28+
* Example: `webapp-john-deployment`, `john-doe-dev` namespace
29+
* **Hostnames**: Node hostnames with user or organizational patterns
30+
* Example: `worker-john-prod-01.company.com`
31+
* **Labels and annotations**: Custom metadata that may include user data
32+
* Example: `owner=john.doe@company.com`
33+
* **Volume paths**: Mount paths revealing directory structures with usernames
34+
* Example: `/home/john/data:/data`
35+
36+
### Cloud Provider Identifiers
37+
38+
* **Account IDs**: AWS account IDs, Azure subscription IDs, GCP project IDs
39+
* **Resource IDs**: Instance IDs, VPC IDs, volume IDs, subnet IDs, security group IDs
40+
* **DNS names**: Load balancer DNS, instance DNS names
41+
* **Geographic data**: Availability zones, regions
42+
43+
### Operational Data That May Reveal personal data
44+
45+
* **DNS queries**: Service/pod names in DNS lookups
46+
* **HTTP/gRPC metadata**: URLs, headers, cookies (if Layer 7 visibility enabled in CNI)
47+
* **Error messages**: Often contain detailed context with resource IDs and user identifiers
48+
* **Audit logs**: Comprehensive request/response data including full user context
49+
50+
## Risk Assessment Matrix
51+
52+
| Component | User Identity | IP Addresses | Credentials | Cloud IDs | Risk Level |
53+
|-----------|---------------|--------------|-------------|-----------|------------|
54+
| kube-apiserver | ✅ High | ✅ High | ✅ High | ❌ No | 🔴 **HIGH** |
55+
| kubelet | ⚠️ Medium | ✅ High | ✅ High | ❌ No | 🔴 **HIGH** |
56+
| etcd | ✅ High | ⚠️ Medium | ✅ High | ❌ No | 🔴 **HIGH** |
57+
| Cloud Controller Managers | ❌ No | ✅ High | ✅ High | ✅ High | 🔴 **HIGH** |
58+
| CSI Drivers | ❌ No | ⚠️ Medium | ✅ High | ✅ High | 🔴 **HIGH** |
59+
| Cilium | ⚠️ Medium | ✅ High | ❌ No | ❌ No | 🟡 **MEDIUM-HIGH** |
60+
| kube-controller-manager | ⚠️ Low | ⚠️ Medium | ⚠️ Medium | ⚠️ Medium | 🟡 **MEDIUM** |
61+
| kube-scheduler | ⚠️ Low | ❌ No | ❌ No | ❌ No | 🟡 **MEDIUM** |
62+
| kube-proxy | ❌ No | ✅ High | ❌ No | ❌ No | 🟡 **MEDIUM** |
63+
| CoreDNS | ⚠️ Low | ⚠️ Medium | ❌ No | ❌ No | 🟡 **MEDIUM** |
64+
| cluster-autoscaler | ⚠️ Low | ⚠️ Low | ⚠️ Low | ✅ High | 🟡 **MEDIUM** |
65+
| NodeLocalDNS | ⚠️ Low | ⚠️ Medium | ❌ No | ❌ No | 🟡 **MEDIUM** |
66+
| metrics-server | ⚠️ Low | ❌ No | ❌ No | ❌ No | 🟢 **LOW-MEDIUM** |
67+
| machine-controller | ⚠️ Low | ❌ No | ⚠️ Low | ✅ High | 🟢 **LOW** |
68+
| operating-system-manager | ⚠️ Low | ❌ No | ❌ No | ⚠️ Low | 🟢 **LOW** |
69+
70+
**Legend**:
71+
72+
* ✅ High: Frequent and detailed PII exposure
73+
* ⚠️ Medium: Moderate PII exposure
74+
* ❌ No: Minimal or no PII exposure
75+
76+
### Understanding Risk Context
77+
78+
While the risk matrix provides a helpful overview of potential PII exposure, it is important to note that the risk is not always proportional to the exposure. For example, a low-risk component may have high exposure if it is combined with a high-risk component.
79+
80+
An example of this would be a component that logs a full Kubernetes resource in case of a validation failure. The Kubernetes resource itself may contain PII, and while the fields that might contain personal data are not directly being referred to in the logs, the full resource is being logged. This results in private data being exposed to the logs. It is always recommended to review and sanitize the logs before sharing them anywhere.
81+
82+
## Log Filtering and Sanitization
83+
84+
### Automated PII Filtering
85+
86+
Implement automated filtering in your log aggregation pipeline to remove PII and personal data from the logs.
87+
88+
#### Use external tools for PII Redaction
89+
90+
* [Presidio](https://microsoft.github.io/presidio/) - A set of tools for data protection and privacy
91+
* [Azure Purview](https://learn.microsoft.com/en-us/purview/information-protection) - A cloud-based data governance service that helps you manage and protect your sensitive data
92+
93+
### Manual PII Filtering - Common patterns to filter
94+
95+
```regex
96+
# Email addresses
97+
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
98+
99+
# IPv4 addresses
100+
\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b
101+
102+
# Basic Auth in URLs
103+
https?://[^:]+:[^@]+@
104+
```
105+
106+
## Best Practices
107+
108+
### Before sharing logs with Kubermatic Support
109+
110+
1. Identify the time range needed (minimize data exposure)
111+
2. Export only relevant namespaces/components
112+
3. Run PII redaction tool or scripts
113+
4. Manual review of first 100 lines to verify redaction
114+
5. Approval from data protection officer (if required)
115+
116+
## Conclusion
117+
118+
### Key Points
119+
120+
1. Kubernetes logs contain significant PII, especially from kube-apiserver, kubelet, etcd, and all cloud provider components
121+
2. Higher log verbosity (v=4-5) dramatically increases PII exposure
122+
3. Cloud provider account identifiers are prevalent in Cloud Controller Managers (CCMs) and CSI drivers
123+
4. Automated filtering tools are essential for safe log sharing at scale
124+
5. Manual review is still necessary to catch context-specific PII
125+
126+
### Best Practice for Support
127+
128+
## Additional Resources
129+
130+
### GDPR and Privacy
131+
132+
* [GDPR Official Text](https://gdpr-info.eu/)
133+
* [Article 29 Working Party Opinion on Personal Data](https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/index_en.htm)

0 commit comments

Comments
 (0)