Optimise import and microsimulation init performance#408
Conversation
Three changes that together reduce import + Microsimulation() time by ~40%: 1. Enum encoding: replace np.select (O(n*m)) with np.searchsorted (O(n log m)) plus cached lookup arrays 2. empty_clone: replace dynamic type creation with object.__new__() 3. Period/instant parsing: add lru_cache to avoid repeated strptime calls
There was a problem hiding this comment.
Pull request overview
This PR optimizes the initialization performance of PolicyEngine microsimulations by approximately 40% (from 10.75s to 6.6s) through three key performance improvements: enum encoding optimization using np.searchsorted, replacing dynamic type creation with object.__new__() in empty_clone, and caching period/instant string parsing.
- Replaced
np.selectwithnp.searchsortedfor O(n log m) enum encoding performance - Simplified
empty_clone()to useobject.__new__()instead of dynamic type creation - Added
@lru_cacheto period and instant parsing functions to avoid repeatedstrptimecalls
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| policyengine_core/enums/enum.py | Implements searchsorted-based enum encoding with cached sorted lookup arrays, but contains a critical bug where invalid enum values can cause IndexError or incorrect results |
| policyengine_core/commons/misc.py | Simplifies empty_clone to use object.new() for 33x performance improvement |
| policyengine_core/periods/helpers.py | Adds LRU caching to instant and period string parsing functions to avoid repeated strptime calls |
| changelog_entry.yaml | Documents the three optimization changes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
- Log warning when encoding invalid enum string values (they default to 0) - Add tests for invalid enum value warning - Document in changelog that random() now produces different sequences 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Review SummaryGreat performance optimization PR! The 40% speedup (10.75s → 6.6s) is excellent. Changes Reviewed ✅
Follow-up Commit AddedI pushed a commit with two small improvements:
All tests pass locally (455 passed). |
MaxGhenis
left a comment
There was a problem hiding this comment.
All CI checks pass. Great performance improvements!
MaxGhenis
left a comment
There was a problem hiding this comment.
All CI checks pass. Great performance improvements!
MaxGhenis
left a comment
There was a problem hiding this comment.
All CI checks pass. Great performance improvements!
Three changes that together reduce
from policyengine_uk import Microsimulation; sim = Microsimulation()time by ~40% (10.75s → 6.6s):Enum encoding - replaced
np.selectwithnp.searchsortedfor O(n log m) lookup instead of O(n*m), with cached sorted lookup arrays via@lru_cacheempty_clone - replaced dynamic type creation with
object.__new__()which is ~33x faster for the 32k calls during parameter cloningPeriod/instant parsing - added
@lru_cacheto_instant_from_stringand_period_from_stringto avoid repeatedstrptimecalls for the same period strings (called ~20k times during fiscal year parameter conversion)Remaining bottlenecks are mostly inherent to data loading (
astypeat 0.80s, HuggingFace dataset at 0.82s) rather than algorithmic inefficiencies.