Skip to content

Commit 6305ff2

Browse files
committed
Enhance NaN handling in probability calculations and clustering algorithms. Added early detection for NaN values in COneOfNPrior and CXMeansOnline1d, preventing propagation and ensuring graceful failure. Updated .gitignore to exclude venv directories. Improved CMake configuration to set CPP_PLATFORM_HOME based on environment variables or source directory.
1 parent 49d85d0 commit 6305ff2

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+26473
-6
lines changed

.env

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
SONARQUBE_TOKEN=squ_850690da2c61b8473ac4ebf8dddfbbfa89d10c50
2+
SONARQUBE_URL="https://sonar.elastic.dev"

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ generated-resources/
2929

3030
# python environment stuff
3131
**/env/*
32+
**/venv/*
3233
*.pyc
3334

3435
# testing stuff

.scannerwork/.sonar_lock

Whitespace-only changes.

.scannerwork/report-task.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
projectKey=elastic_ml-cpp_271ade36-31fc-4c6b-966e-80245560ad14
2+
serverUrl=https://sonar.elastic.dev
3+
serverVersion=10.4.1.88267
4+
dashboardUrl=https://sonar.elastic.dev/dashboard?id=elastic_ml-cpp_271ade36-31fc-4c6b-966e-80245560ad14
5+
ceTaskId=8a6dc50c-b755-468d-8ab1-321c74008419
6+
ceTaskUrl=https://sonar.elastic.dev/api/ce/task?id=8a6dc50c-b755-468d-8ab1-321c74008419

.serena/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
/cache

.serena/project.yml

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# language of the project (csharp, python, rust, java, typescript, go, cpp, or ruby)
2+
# * For C, use cpp
3+
# * For JavaScript, use typescript
4+
# Special requirements:
5+
# * csharp: Requires the presence of a .sln file in the project folder.
6+
language: cpp
7+
8+
# whether to use the project's gitignore file to ignore files
9+
# Added on 2025-04-07
10+
ignore_all_files_in_gitignore: true
11+
# list of additional paths to ignore
12+
# same syntax as gitignore, so you can use * and **
13+
# Was previously called `ignored_dirs`, please update your config if you are using that.
14+
# Added (renamed) on 2025-04-07
15+
ignored_paths: []
16+
17+
# whether the project is in read-only mode
18+
# If set to true, all editing tools will be disabled and attempts to use them will result in an error
19+
# Added on 2025-04-18
20+
read_only: false
21+
22+
# list of tool names to exclude. We recommend not excluding any tools, see the readme for more details.
23+
# Below is the complete list of tools for convenience.
24+
# To make sure you have the latest list of tools, and to view their descriptions,
25+
# execute `uv run scripts/print_tool_overview.py`.
26+
#
27+
# * `activate_project`: Activates a project by name.
28+
# * `check_onboarding_performed`: Checks whether project onboarding was already performed.
29+
# * `create_text_file`: Creates/overwrites a file in the project directory.
30+
# * `delete_lines`: Deletes a range of lines within a file.
31+
# * `delete_memory`: Deletes a memory from Serena's project-specific memory store.
32+
# * `execute_shell_command`: Executes a shell command.
33+
# * `find_referencing_code_snippets`: Finds code snippets in which the symbol at the given location is referenced.
34+
# * `find_referencing_symbols`: Finds symbols that reference the symbol at the given location (optionally filtered by type).
35+
# * `find_symbol`: Performs a global (or local) search for symbols with/containing a given name/substring (optionally filtered by type).
36+
# * `get_current_config`: Prints the current configuration of the agent, including the active and available projects, tools, contexts, and modes.
37+
# * `get_symbols_overview`: Gets an overview of the top-level symbols defined in a given file.
38+
# * `initial_instructions`: Gets the initial instructions for the current project.
39+
# Should only be used in settings where the system prompt cannot be set,
40+
# e.g. in clients you have no control over, like Claude Desktop.
41+
# * `insert_after_symbol`: Inserts content after the end of the definition of a given symbol.
42+
# * `insert_at_line`: Inserts content at a given line in a file.
43+
# * `insert_before_symbol`: Inserts content before the beginning of the definition of a given symbol.
44+
# * `list_dir`: Lists files and directories in the given directory (optionally with recursion).
45+
# * `list_memories`: Lists memories in Serena's project-specific memory store.
46+
# * `onboarding`: Performs onboarding (identifying the project structure and essential tasks, e.g. for testing or building).
47+
# * `prepare_for_new_conversation`: Provides instructions for preparing for a new conversation (in order to continue with the necessary context).
48+
# * `read_file`: Reads a file within the project directory.
49+
# * `read_memory`: Reads the memory with the given name from Serena's project-specific memory store.
50+
# * `remove_project`: Removes a project from the Serena configuration.
51+
# * `replace_lines`: Replaces a range of lines within a file with new content.
52+
# * `replace_symbol_body`: Replaces the full definition of a symbol.
53+
# * `restart_language_server`: Restarts the language server, may be necessary when edits not through Serena happen.
54+
# * `search_for_pattern`: Performs a search for a pattern in the project.
55+
# * `summarize_changes`: Provides instructions for summarizing the changes made to the codebase.
56+
# * `switch_modes`: Activates modes by providing a list of their names
57+
# * `think_about_collected_information`: Thinking tool for pondering the completeness of collected information.
58+
# * `think_about_task_adherence`: Thinking tool for determining whether the agent is still on track with the current task.
59+
# * `think_about_whether_you_are_done`: Thinking tool for determining whether the task is truly completed.
60+
# * `write_memory`: Writes a named memory (for future reference) to Serena's project-specific memory store.
61+
excluded_tools: []
62+
63+
# initial prompt for the project. It will always be given to the LLM upon activating the project
64+
# (contrary to the memories, which are loaded on demand).
65+
initial_prompt: ""
66+
67+
project_name: "ml-cpp"

01-introduction.md

Lines changed: 198 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,198 @@
1+
# ML-CPP: Elastic Machine Learning Core
2+
3+
## Overview
4+
5+
The ML-CPP repository contains the C++ core implementation of Elastic's Machine Learning capabilities, providing high-performance analytics for anomaly detection, data frame analytics, and PyTorch model inference within the Elastic Stack.
6+
7+
## Purpose and Scope
8+
9+
This codebase implements the computational engine for:
10+
11+
- **Time Series Anomaly Detection**: Real-time detection of anomalies in time series data using statistical models
12+
- **Data Frame Analytics**: Supervised learning (classification/regression) and unsupervised learning (outlier detection) on structured data
13+
- **PyTorch Model Inference**: High-performance inference for custom PyTorch models
14+
- **Data Categorization**: Automatic categorization of log messages and text data
15+
16+
## High-Level Architecture
17+
18+
The system follows a layered architecture with clear separation of concerns:
19+
20+
```mermaid
21+
graph TB
22+
subgraph "Executables (bin/)"
23+
A[autodetect] --> B[controller]
24+
C[data_frame_analyzer] --> B
25+
D[pytorch_inference] --> B
26+
E[categorize] --> B
27+
F[normalize] --> B
28+
end
29+
30+
subgraph "API Layer (lib/api/)"
31+
G[CAnomalyJob] --> H[CDataFrameAnalyzer]
32+
I[CIoManager] --> J[CPersistenceManager]
33+
end
34+
35+
subgraph "Model Layer (lib/model/)"
36+
K[CAnomalyDetector] --> L[CDataGatherer]
37+
M[CModelFactory] --> N[CResourceMonitor]
38+
end
39+
40+
subgraph "Mathematics (lib/maths/)"
41+
O[Time Series] --> P[Analytics]
42+
Q[Common] --> R[Linear Algebra]
43+
end
44+
45+
subgraph "Core (lib/core/)"
46+
S[CLogger] --> T[CDataFrame]
47+
U[CMemoryUsage] --> V[CStatePersistInserter]
48+
end
49+
50+
A --> G
51+
C --> H
52+
G --> K
53+
H --> M
54+
K --> O
55+
M --> Q
56+
O --> S
57+
Q --> S
58+
```
59+
60+
## Key Design Principles
61+
62+
### 1. Memory-Conscious Design
63+
- **Resource Monitoring**: Continuous tracking of memory usage with configurable limits
64+
- **Memory Circuit Breakers**: Automatic process termination when memory limits are exceeded
65+
- **Efficient Data Structures**: Specialized containers for time series and sparse data
66+
67+
### 2. State Management
68+
- **Persistence**: Complete model state can be saved and restored
69+
- **Incremental Updates**: Models update incrementally as new data arrives
70+
- **Fault Tolerance**: Robust handling of state corruption and version mismatches
71+
72+
### 3. Performance Optimization
73+
- **Parallel Processing**: Multi-threaded execution where beneficial
74+
- **SIMD Operations**: Vectorized mathematical operations
75+
- **Memory Pooling**: Efficient memory allocation patterns
76+
- **Caching**: Strategic caching of expensive computations
77+
78+
### 4. Extensibility
79+
- **Plugin Architecture**: Modular design for different model types
80+
- **Factory Pattern**: Dynamic model creation based on configuration
81+
- **Interface-Based Design**: Clear abstractions for different components
82+
83+
## Core Components
84+
85+
### Executables (`bin/`)
86+
87+
| Executable | Purpose | Key Features |
88+
|------------|---------|--------------|
89+
| `autodetect` | Time series anomaly detection | Real-time processing, multiple detector types |
90+
| `controller` | Process management | Spawns and manages other ML processes |
91+
| `data_frame_analyzer` | Supervised/unsupervised learning | Boosted trees, outlier detection |
92+
| `pytorch_inference` | PyTorch model inference | Custom model support, batch processing |
93+
| `categorize` | Text categorization | Tokenization, pattern matching |
94+
| `normalize` | Data normalization | Feature scaling, outlier handling |
95+
96+
### Core Libraries
97+
98+
#### `lib/core/` - Fundamental Utilities
99+
- **Logging**: Multi-level logging with named pipe support
100+
- **I/O Management**: Efficient data streaming and parsing
101+
- **Memory Management**: Usage tracking and circuit breakers
102+
- **State Persistence**: Serialization and restoration
103+
- **Concurrency**: Thread-safe operations and synchronization
104+
105+
#### `lib/maths/` - Mathematical Foundation
106+
- **Common**: Statistical functions, linear algebra, probability distributions
107+
- **Time Series**: Seasonal decomposition, trend analysis, forecasting
108+
- **Analytics**: Boosted trees, clustering, feature importance
109+
110+
#### `lib/model/` - Anomaly Detection Models
111+
- **Detectors**: Individual and population-based anomaly detection
112+
- **Data Gatherers**: Time series data collection and bucketing
113+
- **Model Factory**: Dynamic model creation and management
114+
- **Resource Monitoring**: Memory and CPU usage tracking
115+
116+
#### `lib/api/` - High-Level API
117+
- **Job Management**: Configuration and lifecycle management
118+
- **Data Processing**: Input parsing and output formatting
119+
- **Persistence**: State management and restoration
120+
- **I/O Coordination**: Stream management and error handling
121+
122+
## Data Flow Overview
123+
124+
```mermaid
125+
sequenceDiagram
126+
participant ES as Elasticsearch
127+
participant C as Controller
128+
participant A as Autodetect
129+
participant M as Model
130+
participant O as Output
131+
132+
ES->>C: Start job
133+
C->>A: Spawn process
134+
A->>M: Initialize model
135+
ES->>A: Stream data
136+
A->>M: Process records
137+
M->>M: Update model
138+
M->>A: Generate results
139+
A->>O: Write output
140+
O->>ES: Return results
141+
A->>A: Persist state
142+
```
143+
144+
## Key Algorithms
145+
146+
### Time Series Anomaly Detection
147+
- **Statistical Models**: Normal, Poisson, and Gamma distributions
148+
- **Seasonal Decomposition**: Automatic detection of seasonal patterns
149+
- **Change Point Detection**: Identification of regime changes
150+
- **Population Analysis**: Multi-dimensional anomaly detection
151+
152+
### Data Frame Analytics
153+
- **Boosted Trees**: Gradient boosting for classification and regression
154+
- **Outlier Detection**: Distance-based and density-based methods
155+
- **Feature Engineering**: Automatic feature selection and encoding
156+
- **Cross-Validation**: Model validation and hyperparameter tuning
157+
158+
### PyTorch Integration
159+
- **Model Loading**: TorchScript model deserialization
160+
- **Inference Pipeline**: Batch processing and result formatting
161+
- **Memory Management**: Efficient tensor operations
162+
- **Security**: Sandboxed execution environment
163+
164+
## Performance Characteristics
165+
166+
- **Memory Efficiency**: Sub-linear memory growth with data size
167+
- **CPU Optimization**: SIMD operations and parallel processing
168+
- **I/O Efficiency**: Streaming data processing with minimal buffering
169+
- **Scalability**: Horizontal scaling through process spawning
170+
171+
## Development Philosophy
172+
173+
The codebase emphasizes:
174+
175+
1. **Correctness**: Extensive testing and validation
176+
2. **Performance**: Optimized for production workloads
177+
3. **Maintainability**: Clear interfaces and documentation
178+
4. **Reliability**: Robust error handling and recovery
179+
5. **Security**: Sandboxed execution and input validation
180+
181+
## Getting Started
182+
183+
For developers new to the codebase:
184+
185+
1. **Start with Core**: Understand `lib/core/` utilities and abstractions
186+
2. **Explore Models**: Study `lib/model/` for anomaly detection concepts
187+
3. **Examine APIs**: Review `lib/api/` for high-level interfaces
188+
4. **Run Examples**: Use the executables with sample data
189+
5. **Read Tests**: Unit tests provide excellent usage examples
190+
191+
## Next Steps
192+
193+
- [Architecture Details](02-architecture.md) - Deep dive into system design
194+
- [Core Libraries](03-core-libraries.md) - Fundamental utilities and abstractions
195+
- [Mathematical Foundation](04-mathematics.md) - Algorithms and statistical methods
196+
- [Model Layer](05-model-layer.md) - Anomaly detection implementation
197+
198+

0 commit comments

Comments
 (0)