Optimise import and microsimulation init performance by nikhilwoodruff · Pull Request #408 · PolicyEngine/policyengine-core

nikhilwoodruff · 2025-11-30T21:43:10Z

Three changes that together reduce from policyengine_uk import Microsimulation; sim = Microsimulation() time by ~40% (10.75s → 6.6s):

Enum encoding - replaced np.select with np.searchsorted for O(n log m) lookup instead of O(n*m), with cached sorted lookup arrays via @lru_cache

empty_clone - replaced dynamic type creation with object.__new__() which is ~33x faster for the 32k calls during parameter cloning

Period/instant parsing - added @lru_cache to _instant_from_string and _period_from_string to avoid repeated strptime calls for the same period strings (called ~20k times during fiscal year parameter conversion)

Remaining bottlenecks are mostly inherent to data loading (astype at 0.80s, HuggingFace dataset at 0.82s) rather than algorithmic inefficiencies.

Three changes that together reduce import + Microsimulation() time by ~40%: 1. Enum encoding: replace np.select (O(n*m)) with np.searchsorted (O(n log m)) plus cached lookup arrays 2. empty_clone: replace dynamic type creation with object.__new__() 3. Period/instant parsing: add lru_cache to avoid repeated strptime calls

Copilot

Pull request overview

This PR optimizes the initialization performance of PolicyEngine microsimulations by approximately 40% (from 10.75s to 6.6s) through three key performance improvements: enum encoding optimization using np.searchsorted, replacing dynamic type creation with object.__new__() in empty_clone, and caching period/instant string parsing.

Replaced np.select with np.searchsorted for O(n log m) enum encoding performance
Simplified empty_clone() to use object.__new__() instead of dynamic type creation
Added @lru_cache to period and instant parsing functions to avoid repeated strptime calls

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
policyengine_core/enums/enum.py	Implements searchsorted-based enum encoding with cached sorted lookup arrays, but contains a critical bug where invalid enum values can cause IndexError or incorrect results
policyengine_core/commons/misc.py	Simplifies empty_clone to use object.new() for 33x performance improvement
policyengine_core/periods/helpers.py	Adds LRU caching to instant and period string parsing functions to avoid repeated strptime calls
changelog_entry.yaml	Documents the three optimization changes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

policyengine_core/enums/enum.py

- Log warning when encoding invalid enum string values (they default to 0) - Add tests for invalid enum value warning - Document in changelog that random() now produces different sequences 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

MaxGhenis · 2025-12-01T15:13:04Z

Review Summary

Great performance optimization PR! The 40% speedup (10.75s → 6.6s) is excellent.

Changes Reviewed ✅

Enum encoding - O(n log m) with searchsorted vs O(n×m) with np.select - solid algorithmic improvement
empty_clone - object.__new__() is the canonical way to create uninitialized instances
Period/instant caching - LRU cache avoids ~20k repeated strptime calls
Vectorized random() - PCG hash is well-established for quality pseudo-random generation

Follow-up Commit Added

I pushed a commit with two small improvements:

Warning for invalid enum values: The encode method now logs a warning when it encounters invalid string values (which silently default to index 0). This helps catch bugs without breaking backward compatibility.
Documented random() behavioral change: The changelog now notes that random() produces different (but still deterministic) sequences. This is important for users who may rely on specific random sequences for reproducibility.

All tests pass locally (455 passed).

MaxGhenis

All CI checks pass. Great performance improvements!

MaxGhenis

All CI checks pass. Great performance improvements!

MaxGhenis

All CI checks pass. Great performance improvements!

MaxGhenis

LGTM

MaxGhenis

LGTM

MaxGhenis

LGTM

nikhilwoodruff added 2 commits November 30, 2025 21:42

Add changelog entry

f0e8f66

nikhilwoodruff requested review from anth-volk and Copilot November 30, 2025 21:50

Copilot started reviewing on behalf of nikhilwoodruff November 30, 2025 21:50 View session

Copilot finished reviewing on behalf of nikhilwoodruff November 30, 2025 21:53

Copilot AI reviewed Nov 30, 2025

View reviewed changes

policyengine_core/enums/enum.py Show resolved Hide resolved

policyengine_core/enums/enum.py Outdated Show resolved Hide resolved

nikhilwoodruff and others added 4 commits November 30, 2025 22:01

Vectorise random() function with PCG hash

1b58bdd

Fix enum encoding edge cases per review comments

24858df

Return 0 for invalid enum values to match old np.select behaviour

77878fb

MaxGhenis approved these changes Dec 1, 2025

View reviewed changes

MaxGhenis merged commit 8b9663c into master Dec 1, 2025
14 checks passed

MaxGhenis deleted the perf/import-optimisations branch December 1, 2025 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise import and microsimulation init performance#408

Optimise import and microsimulation init performance#408
MaxGhenis merged 6 commits intomasterfrom
perf/import-optimisations

nikhilwoodruff commented Nov 30, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

MaxGhenis commented Dec 1, 2025

Uh oh!

MaxGhenis left a comment

Uh oh!

MaxGhenis left a comment

Uh oh!

MaxGhenis left a comment

Uh oh!

MaxGhenis left a comment

Uh oh!

MaxGhenis left a comment

Uh oh!

MaxGhenis left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nikhilwoodruff commented Nov 30, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

MaxGhenis commented Dec 1, 2025

Review Summary

Changes Reviewed ✅

Follow-up Commit Added

Uh oh!

MaxGhenis left a comment

Choose a reason for hiding this comment

Uh oh!

MaxGhenis left a comment

Choose a reason for hiding this comment

Uh oh!

MaxGhenis left a comment

Choose a reason for hiding this comment

Uh oh!

MaxGhenis left a comment

Choose a reason for hiding this comment

Uh oh!

MaxGhenis left a comment

Choose a reason for hiding this comment

Uh oh!

MaxGhenis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants