Skip to content

Conversation

@EslaMx7
Copy link
Collaborator

@EslaMx7 EslaMx7 commented Oct 15, 2024

Summary

This is a basic minimal effort to maintain unit test functionality after transitioning to char[]. I aim to highlight the specific areas in the code that are heavily reliant on string. For some of these areas, we will need to reconstruct/allocate new strings, while for others, we might have to rethink the logic to accommodate char[].

Open Questions:

  1. Should we convert the string properties of all other classes to char[], similar to Node.Text?
  2. What methods/tools should we use to evaluate the performance enhancements (speed, memory allocation, etc...) following this refactoring?

"If you can’t measure it, you can’t improve it." – Peter Drucker

Currently, I can track the total duration of all unit tests, though it's not a precise metric, as each tokenization test case runs for less than one second.


Changes

Refactor Token class to use char[] instead of string

  1. Updated Token.Text from string to char[], requiring extensive changes across the codebase. Methods now handle char[] and ReadOnlySpan<char> for text manipulation.
  2. Updated constructors, methods, and utilities to support the new type.
  3. Added extension methods for SpanRuneEnumerator to List conversion.
  4. Ensured all text processing functions in TokenizationUtils, BaseTokenizer, XLMRobertaTokenizer, and SentencePieceModel are compatible with char[].

Updated Token.Text from string to char[], requiring extensive
changes across the codebase. Methods now handle char[] and
ReadOnlySpan<char> for text manipulation. Updated constructors,
methods, and utilities to support the new type. Added extension
methods for SpanRuneEnumerator to List<Rune> conversion.
Ensured all text processing functions in TokenizationUtils,
BaseTokenizer, XLMRobertaTokenizer, and SentencePieceModel
are compatible with char[].
Refactored various methods and classes to use `char[]` instead of `string` for improved performance and memory efficiency. Updated method signatures, parameter types, and internal logic accordingly.

- `BaseTokenizer.cs`: Updated `SplitOnSpecialTokens` and `SplitOnSubstr` to use `Func<char[], (int, int, Mask)>`.
- `TokenizationUtils`: Removed and reintroduced `ToList` extension for `SpanRuneEnumerator`. Refactored methods like `SubstringRunes`, `GetUtf8BytesCount`, and `SubstringByByteOffset` to work with `char[]`.
- `SentencePieceUnigramModel.cs`: Modified token text processing to use `char[]`, including `SubstringByByteOffset` method calls.
- `TokenizationUtilsTests`: Updated test cases to convert `string` to `char[]`.
@EslaMx7 EslaMx7 added the enhancement New feature or request label Oct 15, 2024
@EslaMx7 EslaMx7 self-assigned this Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants