Skip to content

Conversation

@anthony-swirldslabs
Copy link
Contributor

@anthony-swirldslabs anthony-swirldslabs commented Oct 1, 2025

Description:
This is the last part of changes extracted from @jasperpotts 's draft PoC #612 . We store string fields as UTF-8 byte arrays internally in models. We allow codecs to parse original UTF-8 and implant them directly into the model w/o performing the UTF-8 decoding (which internally would convert the text into UTF-16 (or Latin-1) because that's how Java stores strings.) The public API of the models is unchanged, and whenever anyone calls a getter for a string field, the model will encode it into a Java string on the fly.

The idea behind this optimization is that we rarely read or use strings in our business logic. So we could as well skip the decoding part when parsing models.

Three caveats come with the fix:

  1. New public constructors are introduced that accept raw byte[] for strings and also have a fake unused argument so as to resolve a generic erasure clash. This looks a bit ugly. Also, this is not very safe because it allows a malicious code to create a mutable model instance (by retaining references to their byte[]) and then pass the object to a code that expects the models to be immutable. We could make the constructors private, but we'd have to move all the codecs into the same package with their models, which could be a breaking change.
  2. If we end up reading the string multiple times, either directly or via log/toString(), either now or in the future, then this optimization will be defeated as we'll be encoding the bytes multiple times. We could work-around this by adding caching for the encoded strings, same as we do with hashCode/protobufSize already. Also, we'll be running performance tests once these changes are released.
  3. The fuzz test thresholds had to be relaxed because PBJ no longer decodes UTF-8 strings when parsing models, so it now fails less often than Protoc. Also, this means that encoding errors may now happen inside our business logic code rather than at deserialization points as before. There isn't a simple solution for this caveat.

Related issue(s):

Fixes #620

Notes for reviewer:
All tests should pass.

Checklist

  • Documented (Code comments, README, etc.)
  • Tested (unit, integration, etc.)

Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
@github-actions
Copy link

github-actions bot commented Oct 1, 2025

JUnit Test Report

   77 files  ±0     77 suites  ±0   6m 18s ⏱️ + 3m 34s
1 328 tests ±0  1 324 ✅ ±0   4 💤 ±0  0 ❌ ±0 
7 185 runs  ±0  7 165 ✅ ±0  20 💤 ±0  0 ❌ ±0 

Results for commit 67e2746. ± Comparison against base commit 756dcbb.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Oct 1, 2025

Integration Test Report

    408 files  + 3      408 suites  +3   14m 17s ⏱️ - 2m 35s
114 854 tests +17  114 854 ✅ +17  0 💤 ±0  0 ❌ ±0 
115 095 runs  +17  115 095 ✅ +17  0 💤 ±0  0 ❌ ±0 

Results for commit 67e2746. ± Comparison against base commit 756dcbb.

♻️ This comment has been updated with latest results.

imalygin
imalygin previously approved these changes Oct 3, 2025
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
Signed-off-by: Anthony Petrov <anthony@swirldslabs.com>
@jasperpotts
Copy link
Member

For concern(1) I have one idea but don't love it.

You could make the byte[] constructor protected then make model objects sealed. In codec you could have a sub-class of model object that allows use of protected constructor. The model object could be sealed to only allow that single sub-class. No idea if it would have performance consequences, I hate it on a ugly point of view but maybe less than exposing byte[] not sure.

On concern (2) we should get a performance run with this branch by Alex to see how it effects CN performance before merging. If the gain is tiny or non-existent then lets not merge.

@anthony-swirldslabs
Copy link
Contributor Author

Closing the PR per #620 (comment) .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Revisit String performance

4 participants