Skip to content

perf: use decodeURIComponent for UTF-8 extended parameter decoding#115

Merged
blakeembrey merged 4 commits intojshttp:masterfrom
Phillip9587:decodefield-utf8
Feb 23, 2026
Merged

perf: use decodeURIComponent for UTF-8 extended parameter decoding#115
blakeembrey merged 4 commits intojshttp:masterfrom
Phillip9587:decodefield-utf8

Conversation

@Phillip9587
Copy link
Contributor

@Phillip9587 Phillip9587 commented Feb 10, 2026

Improves the performance of UTF-8 extended parameter decoding by 9-21x using native decodeURIComponent(), with graceful fallback for edge cases.

Changes

  • UTF-8 path: Use decodeURIComponent() for faster native UTF-8 handling. If decodeURIComponent() throws, fall back to manual decoding for backward compatibility with malformed percent sequences
  • ISO-8859-1 path: Preserve existing implementation

Benchmarks

=== Environment ===

┌──────────────┬──────────────────────────────────────────────┐
│ node         │ 'v25.6.1'                                    │
│ platform     │ 'linux'                                      │
│ arch         │ 'x64'                                        │
│ os           │ 'Linux 6.17.0-14-generic'                    │
│ memory.total │ 96483323904                                  │
│ memory.free  │ 82700402688                                  │
│ cpu.model    │ 'AMD Ryzen 9 8945HS w/ Radeon 780M Graphics' │
│ cpu.cores    │ 16                                           │
└──────────────┴──────────────────────────────────────────────┘

=== Results ===

┌─────────┬───────────────────────────────────────┬───────────────┬─────────┬──────────────┬───────────────┐
│ (index) │ Test                                  │ Avg Time (µs) │ Ops/sec │ Min (µs)     │ Max (µs)      │
├─────────┼───────────────────────────────────────┼───────────────┼─────────┼──────────────┼───────────────┤
│ 0       │ 'decodeWithRegex - multiple short'    │ '4764.80'     │ 213161  │ '4318.00'    │ '1938768.00'  │
│ 1       │ 'decodeHexEscapes - multiple short'   │ '1583.74'     │ 649343  │ '1362.00'    │ '723676.00'   │
│ 2       │ 'decodeURIComponent - multiple short' │ '480.58'      │ 2119663 │ '430.00'     │ '1485443.00'  │
│ 3       │ 'decodeWithRegex - long'              │ '4486.74'     │ 237722  │ '3867.00'    │ '25382214.00' │
│ 4       │ 'decodeHexEscapes - long'             │ '1270.93'     │ 805380  │ '1173.00'    │ '6660762.00'  │
│ 5       │ 'decodeURIComponent - long'           │ '298.78'      │ 3456286 │ '260.00'     │ '9897425.00'  │
│ 6       │ 'decodeWithRegex - very long'         │ '5058387.97'  │ 209     │ '4337660.00' │ '39812382.00' │
│ 7       │ 'decodeHexEscapes - very long'        │ '1395493.43'  │ 732     │ '1256357.00' │ '3560765.00'  │
│ 8       │ 'decodeURIComponent - very long'      │ '239006.61'   │ 4496    │ '185345.00'  │ '7202521.00'  │
└─────────┴───────────────────────────────────────┴───────────────┴─────────┴──────────────┴───────────────┘

=== Comparison ===

multiple short (vs decodeWithRegex):
  decodeHexEscapes is 3.01x faster
  decodeURIComponent is 9.91x faster

long (vs decodeWithRegex):
  decodeHexEscapes is 3.53x faster
  decodeURIComponent is 15.02x faster

very long (vs decodeWithRegex):
  decodeHexEscapes is 3.62x faster
  decodeURIComponent is 21.16x faster

This can be seen as alternative or addition to #112


closes #112 closes #114

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes RFC 5987 extended parameter decoding by switching the UTF-8 decoding path to native decodeURIComponent() (with a manual fallback), while keeping the ISO-8859-1 behavior equivalent.

Changes:

  • Use decodeURIComponent() for UTF-8 extended parameter decoding, with fallback to manual %xx decoding + Buffer UTF-8 decoding on failures.
  • Replace regex-based %xx detection/decoding with helper functions (hasHexEscape, decodeHexEscapes, isHexDigit).
  • Minor doc comment formatting adjustments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Phillip9587
Copy link
Contributor Author

@blakeembrey The Copilot review

The catch-block comment says this fallback is for “malformed percent sequences”, but EXT_VALUE_REGEXP already rejects malformed % escapes; in practice decodeURIComponent() will mainly throw on invalid UTF-8 byte sequences (e.g. %E4). Suggest updating the comment (and consider removing the inline TODO, or tracking it via an issue/changelog) to avoid misleading future maintainers about when this path runs.

was right. The EXT_VALUE_REGEXP already checks for valid hex escapes. I used TextDecoder like in #112 for the fallback when decodeURIComponent() fails and it passes all test including

assert.deepEqual(contentDisposition.parse('attachment; filename*=UTF-8\'\'%E4%20rates.pdf'), {
type: 'attachment',
parameters: { filename: '\ufffd rates.pdf' }
})

which checks for the invalid byte sequence %E4 and replaces it with the unicode replacement character.

So i would definitly prefer this PR over #112. The last thing we would need to decide is: #115 (comment)

@Phillip9587
Copy link
Contributor Author

This PR is ready now 🚀

@blakeembrey blakeembrey merged commit dffa489 into jshttp:master Feb 23, 2026
14 checks passed
@Phillip9587 Phillip9587 deleted the decodefield-utf8 branch February 23, 2026 19:08
@ChALkeR
Copy link
Contributor

ChALkeR commented Feb 23, 2026

Hi there from nodejs/node#61041 which you linked to in #112.

  1. You are using TextDecoder a bit wrong, at least you probably want consitent BOM handling, so either ignoreBOM or strip it on all platforms
  2. @exodus/bytes has a fast fallback impl (upd: I checked the code, you likely don't need a fallback impl)
  3. You are double-converting hex -> bytes -> string -> bytes -> u8arr

@blakeembrey
Copy link
Member

You are using TextDecoder a bit wrong, at least you probably want consitent BOM handling, so either ignoreBOM or strip it on all platforms

It's only used as a fallback when decodeURIComponent fails, but this is a good point that we should have a test for the BOM.

You are double-converting hex -> bytes -> string -> bytes -> u8arr

In the fallback? That's a good point, it could be simpler to keep it as code points. @Phillip9587 Do you want to run some benchmarks for this in a new PR?

@exodus/bytes has a fast fallback impl

For the most part this is fallback code when decodeURIComponent fails, so I think it's not a huge trade-off. We also could drop usage of TextDecoder entirely in the next major. We need to keep in mind package size for browser environments but it's a good point that we could allow decoders to be injected into the package so someone can swap this into the library instead of the default TextDecoder.

@Phillip9587
Copy link
Contributor Author

In the fallback? That's a good point, it could be simpler to keep it as code points. @Phillip9587 Do you want to run some benchmarks for this in a new PR?

I explored the direct hex -> bytes -> u8arr approach to avoid the double conversion.

function decodeHexEscapesToBytes (str) {
  const bytes = new Uint8Array(str.length)
  let offset = 0

  for (let idx = 0; idx < str.length; idx++) {
    if (
      str[idx] === '%' &&
      idx + 2 < str.length &&
      isHexDigit(str[idx + 1]) &&
      isHexDigit(str[idx + 2])
    ) {
      bytes[offset++] = Number.parseInt(str[idx + 1] + str[idx + 2], 16)
      idx += 2
    } else {
      bytes[offset++] = str.charCodeAt(idx)
    }
  }

  return bytes.slice(0, offset)
}

As the string may contain percent escapes, the Uint8Array would be over-allocated. However, the overhead is negligible given HTTP header size limits, with the worst case being ~2x allocation.

It also would require us to keep the existing decodeHexEscapes() in order to keep the iso-8859-1 path in decodeField() fast. I explored using the bytes conversion and manual loop for latin1:

const bytes = decodeHexEscapesToBytes(encoded);
  let string = "";
  for (let idx = 0; idx < bytes.length; idx++) {
    // Filter to printable Latin-1: 0x20-0x7E (printable ASCII) or 0xA0-0xFF (high Latin-1)
    if (
      (bytes[idx] >= 0x20 && bytes[idx] <= 0x7e) ||
      (bytes[idx] >= 0xa0 && bytes[idx] <= 0xff)
    ) {
      string += String.fromCharCode(bytes[idx]);
    } else {
      string += "?";
    }
  }
  return string;

And benchmarked both approaches (bytes -> manual loop and string-> regex) (Details beolw):

  • ASCII-only input: current string + NON_LATIN1_REGEXP approach is ~3-4x faster
  • Latin-1 / light encoding: performance is slightly better or roughly equal
  • Heavy percent-encoding: the byte-loop approach is somewhat slower
Benchmark Details:
"ASCII only (20 chars)":              "my_document_file.pdf",
"ASCII with spaces (30 chars)":       "my%20document%20with%20spaces.txt",
"Latin-1 with accents (25 chars)":    "caf%E9_r%E9sum%E9_naive.pdf",
"Mixed content (50 chars)":           "my_file_%E9_test_%20with_%C3%A4.txt",
"Long percent-encoded (100 chars)":   "doc_%E9_%E0_%E8_%2B_%2D_%2F_%E9_%E9_%E9_%E9_%E9_%E9_%E9_%E9_%E9_%E9_%E9_%E9_%E9_%E9.txt",
"Heavy percent-encoding (100 chars)": "%20%21%22%23%24%25%26%27%28%29%2A%2B%2C%2D%2E%2F0123456789%3A%3B%3C%3D%3E%3F%40ABCDEFGHIJ",
┌─────────┬────────────────────────────────────────────┬──────────────────┬──────────────────┬────────────────────────┬────────────────────────┬─────────┐
│ (index) │ Task name                                  │ Latency avg (ns) │ Latency med (ns) │ Throughput avg (ops/s) │ Throughput med (ops/s) │ Samples │
├─────────┼────────────────────────────────────────────┼──────────────────┼──────────────────┼────────────────────────┼────────────────────────┼─────────┤
│ 0       │ 'old - ASCII only (20 chars)'              │ '55.29 ± 0.24%'  │ '50.00 ± 0.00'   │ '18960208 ± 0.02%'     │ '20000000 ± 0'         │ 1808563 │
│ 1       │ 'new - ASCII only (20 chars)'              │ '201.12 ± 1.00%' │ '181.00 ± 1.00'  │ '5355849 ± 0.03%'      │ '5524862 ± 30694'      │ 497205  │
│ 2       │ 'old - ASCII with spaces (30 chars)'       │ '349.37 ± 2.05%' │ '330.00 ± 9.00'  │ '3038513 ± 0.03%'      │ '3030303 ± 84962'      │ 286226  │
│ 3       │ 'new - ASCII with spaces (30 chars)'       │ '319.42 ± 2.49%' │ '291.00 ± 9.00'  │ '3355795 ± 0.04%'      │ '3436426 ± 103093'     │ 313067  │
│ 4       │ 'old - Latin-1 with accents (25 chars)'    │ '302.69 ± 1.82%' │ '281.00 ± 10.00' │ '3498993 ± 0.03%'      │ '3558719 ± 122293'     │ 330373  │
│ 5       │ 'new - Latin-1 with accents (25 chars)'    │ '283.39 ± 2.55%' │ '261.00 ± 9.00'  │ '3743698 ± 0.03%'      │ '3831418 ± 127714'     │ 352876  │
│ 6       │ 'old - Mixed content (50 chars)'           │ '341.98 ± 1.80%' │ '330.00 ± 10.00' │ '3025454 ± 0.03%'      │ '3030303 ± 89127'      │ 292414  │
│ 7       │ 'new - Mixed content (50 chars)'           │ '359.48 ± 9.12%' │ '311.00 ± 9.00'  │ '3095145 ± 0.04%'      │ '3215434 ± 90434'      │ 278182  │
│ 8       │ 'old - Long percent-encoded (100 chars)'   │ '922.50 ± 1.65%' │ '862.00 ± 10.00' │ '1139369 ± 0.05%'      │ '1160093 ± 13616'      │ 108402  │
│ 9       │ 'new - Long percent-encoded (100 chars)'   │ '1075.8 ± 0.74%' │ '982.00 ± 30.00' │ '981104 ± 0.08%'       │ '1018330 ± 32090'      │ 92957   │
│ 10      │ 'old - Heavy percent-encoding (100 chars)' │ '962.59 ± 1.29%' │ '902.00 ± 10.00' │ '1088949 ± 0.05%'      │ '1108647 ± 12429'      │ 103887  │
│ 11      │ 'new - Heavy percent-encoding (100 chars)' │ '1137.8 ± 1.34%' │ '1032.0 ± 40.00' │ '932541 ± 0.08%'       │ '968992 ± 36156'       │ 87887   │
└─────────┴────────────────────────────────────────────┴──────────────────┴──────────────────┴────────────────────────┴────────────────────────┴─────────┘

For the most part this is fallback code when decodeURIComponent fails, so I think it's not a huge trade-off. We also could drop usage of TextDecoder entirely in the next major. We need to keep in mind package size for browser environments but it's a good point that we could allow decoders to be injected into the package so someone can swap this into the library instead of the default TextDecoder.

I agree, I don’t think it’s worth optimizing a fallback path we’re planning to remove in the next major. Since it would require duplicating logic, the added complexity doesn’t seem justified. What do you think, @blakeembrey?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compatibility with non-Node.js environments

4 participants