perf: use decodeURIComponent for UTF-8 extended parameter decoding#115
perf: use decodeURIComponent for UTF-8 extended parameter decoding#115blakeembrey merged 4 commits intojshttp:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR optimizes RFC 5987 extended parameter decoding by switching the UTF-8 decoding path to native decodeURIComponent() (with a manual fallback), while keeping the ISO-8859-1 behavior equivalent.
Changes:
- Use
decodeURIComponent()for UTF-8 extended parameter decoding, with fallback to manual%xxdecoding +BufferUTF-8 decoding on failures. - Replace regex-based
%xxdetection/decoding with helper functions (hasHexEscape,decodeHexEscapes,isHexDigit). - Minor doc comment formatting adjustments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@blakeembrey The Copilot review
was right. The content-disposition/test/test.js Lines 404 to 407 in 1a4f3bf which checks for the invalid byte sequence %E4 and replaces it with the unicode replacement character. So i would definitly prefer this PR over #112. The last thing we would need to decide is: #115 (comment) |
|
This PR is ready now 🚀 |
|
Hi there from nodejs/node#61041 which you linked to in #112.
|
It's only used as a fallback when
In the fallback? That's a good point, it could be simpler to keep it as code points. @Phillip9587 Do you want to run some benchmarks for this in a new PR?
For the most part this is fallback code when |
I explored the direct function decodeHexEscapesToBytes (str) {
const bytes = new Uint8Array(str.length)
let offset = 0
for (let idx = 0; idx < str.length; idx++) {
if (
str[idx] === '%' &&
idx + 2 < str.length &&
isHexDigit(str[idx + 1]) &&
isHexDigit(str[idx + 2])
) {
bytes[offset++] = Number.parseInt(str[idx + 1] + str[idx + 2], 16)
idx += 2
} else {
bytes[offset++] = str.charCodeAt(idx)
}
}
return bytes.slice(0, offset)
}As the string may contain percent escapes, the Uint8Array would be over-allocated. However, the overhead is negligible given HTTP header size limits, with the worst case being ~2x allocation. It also would require us to keep the existing const bytes = decodeHexEscapesToBytes(encoded);
let string = "";
for (let idx = 0; idx < bytes.length; idx++) {
// Filter to printable Latin-1: 0x20-0x7E (printable ASCII) or 0xA0-0xFF (high Latin-1)
if (
(bytes[idx] >= 0x20 && bytes[idx] <= 0x7e) ||
(bytes[idx] >= 0xa0 && bytes[idx] <= 0xff)
) {
string += String.fromCharCode(bytes[idx]);
} else {
string += "?";
}
}
return string;And benchmarked both approaches (bytes -> manual loop and string-> regex) (Details beolw):
Benchmark Details:
I agree, I don’t think it’s worth optimizing a fallback path we’re planning to remove in the next major. Since it would require duplicating logic, the added complexity doesn’t seem justified. What do you think, @blakeembrey? |
Improves the performance of UTF-8 extended parameter decoding by 9-21x using native
decodeURIComponent(), with graceful fallback for edge cases.Changes
decodeURIComponent()for faster native UTF-8 handling. IfdecodeURIComponent()throws, fall back to manual decoding for backward compatibility with malformed percent sequencesBenchmarks
This can be seen as alternative or addition to #112
closes #112 closes #114