Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 26 additions & 10 deletions src/Address.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,34 @@ import * as path from "path";
import * as url from "url";

export function sanitizeUrl(urlRemote: string) {
var urlParts = url.parse(urlRemote, false, true);
var origin = urlParts.host || "";
try {
const urlObj = new URL(urlRemote);
var origin = urlObj.host || "";

if ((urlParts.pathname || "").charAt(0) != "/") origin += "/";
if ((urlObj.pathname || "").charAt(0) != "/") origin += "/";

origin += urlParts.pathname;
return [
urlParts.protocol || "http:",
"//",
url.resolve("", origin),
urlParts.search || ""
].join("");
origin += urlObj.pathname;
return [
urlObj.protocol || "http:",
"//",
url.resolve("", origin),
urlObj.search || ""
].join("");
} catch (e) {
// If URL is invalid, try to construct it with a base
const urlObj = new URL(urlRemote, "http://dummy");
var origin = urlObj.host || "";

if ((urlObj.pathname || "").charAt(0) != "/") origin += "/";

origin += urlObj.pathname;
return [
urlObj.protocol || "http:",
"//",
url.resolve("", origin),
urlObj.search || ""
].join("");
Comment on lines +23 to +34

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 sanitizeUrl catch branch leaks dummy host into output for protocol-less URLs

When sanitizeUrl receives a URL without a protocol (e.g., example.com/path), new URL() throws on line 9, and the catch branch on line 23 uses new URL(urlRemote, "http://dummy"). This resolves the input relative to the dummy base, causing urlObj.host to be "dummy" and the pathname to be /example.com/path.

Root Cause and Impact

The old code used url.parse(urlRemote, false, true) where the third argument (true) enabled slashesDenotingHost mode. For input example.com/path, this correctly parsed it as host: null, pathname: 'example.com/path', producing http:///example.com/path.

The new catch branch with new URL('example.com/path', 'http://dummy') instead produces:

  • host: 'dummy'
  • pathname: '/example.com/path'
  • Final result: http://dummy/example.com/path

This means any protocol-less URL passed to sanitizeUrl will have the fake host dummy injected into the output. While the test suite primarily tests full URLs with protocols (so existing tests pass), this is a behavioral regression for any real-world usage with bare hostnames or protocol-relative URLs like //example.com/path where the host is correctly handled but the protocol is inherited from the dummy base.

Impact: Protocol-less URLs (if any are used in practice) will be incorrectly resolved against a fake dummy host, producing invalid URLs.

Prompt for agents
In src/Address.ts, the catch branch of sanitizeUrl (lines 21-35) uses `new URL(urlRemote, 'http://dummy')` as a fallback, but this causes the dummy host to leak into the output for protocol-less inputs like 'example.com/path'. To replicate the old url.parse(urlRemote, false, true) behavior, you need to handle protocol-less URLs differently. One approach: manually prepend 'http://' to the input and re-parse with `new URL('http://' + urlRemote)` if the input doesn't start with '//', or use `new URL('http:' + urlRemote)` for protocol-relative URLs starting with '//'. This preserves the original host parsing semantics without injecting a dummy host.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

}
Comment on lines +8 to +35
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "Address.ts" -o -name "address.ts" | head -20

Repository: loanlink-nl/cget

Length of output: 76


🏁 Script executed:

cat -n src/Address.ts | head -50

Repository: loanlink-nl/cget

Length of output: 1741


🏁 Script executed:

rg -t ts -t js "sanitizeUrl" --max-count=20 -B 2 -A 2

Repository: loanlink-nl/cget

Length of output: 1104


🏁 Script executed:

fd -e test.ts -e test.js -e spec.ts -e spec.js | head -20

Repository: loanlink-nl/cget

Length of output: 97


🏁 Script executed:

cat -n src/Address.test.ts | head -100

Repository: loanlink-nl/cget

Length of output: 3253


🏁 Script executed:

cat -n src/Address.test.ts | tail -100

Repository: loanlink-nl/cget

Length of output: 3535


🏁 Script executed:

rg "new URL" -B 3 -A 3 -t ts

Repository: loanlink-nl/cget

Length of output: 1430


🏁 Script executed:

sed -n '38,150p' src/Address.ts

Repository: loanlink-nl/cget

Length of output: 2349


Fallback path injects dummy host into sanitized URLs, corrupting URL identity.

Lines 23-35 use new URL(urlRemote, "http://dummy") as a fallback for invalid URLs. When triggered (e.g., scheme-less input like "www.example.org"), the URL object's host becomes "dummy" and the input becomes a pathname. This corrupts the URL identity, breaking cache keys and remote fetch operations. The block also redeclares origin at line 24 using function-scoped var.

Suggested fix
 export function sanitizeUrl(urlRemote: string) {
+  const toSanitized = (urlObj: URL, stripDummyHost = false) => {
+    let origin =
+      stripDummyHost && urlObj.host === "dummy" ? "" : urlObj.host || "";
+
+    if ((urlObj.pathname || "").charAt(0) != "/") origin += "/";
+    origin += urlObj.pathname;
+
+    return [
+      urlObj.protocol || "http:",
+      "//",
+      url.resolve("", origin),
+      urlObj.search || ""
+    ].join("");
+  };
+
   try {
-    const urlObj = new URL(urlRemote);
-    var origin = urlObj.host || "";
-
-    if ((urlObj.pathname || "").charAt(0) != "/") origin += "/";
-
-    origin += urlObj.pathname;
-    return [
-      urlObj.protocol || "http:",
-      "//",
-      url.resolve("", origin),
-      urlObj.search || ""
-    ].join("");
+    return toSanitized(new URL(urlRemote));
   } catch (e) {
-    // If URL is invalid, try to construct it with a base
-    const urlObj = new URL(urlRemote, "http://dummy");
-    var origin = urlObj.host || "";
-
-    if ((urlObj.pathname || "").charAt(0) != "/") origin += "/";
-
-    origin += urlObj.pathname;
-    return [
-      urlObj.protocol || "http:",
-      "//",
-      url.resolve("", origin),
-      urlObj.search || ""
-    ].join("");
+    // If URL is invalid, try to construct it with a base.
+    return toSanitized(new URL(urlRemote, "http://dummy"), true);
   }
 }
🧰 Tools
🪛 Biome (2.4.4)

[error] 24-24: Shouldn't redeclare 'origin'. Consider to delete it or rename it.

(lint/suspicious/noRedeclare)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Address.ts` around lines 8 - 35, The fallback path currently uses new
URL(urlRemote, "http://dummy") which injects a dummy host and declares origin
with var; instead detect scheme-less inputs and normalize before parsing: if
urlRemote does not start with "http://", "https://", or "//" prepend "http://"
(or the desired default scheme) and then construct new URL(normalized); also
replace function-scoped var origin with let/const and keep the same
origin-building logic (use urlObj.host and urlObj.pathname) so the host is not
replaced by a dummy value and cache keys stay stable.

}

/** Last line of defence to filter malicious paths. */
Expand Down
21 changes: 11 additions & 10 deletions src/Cache.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
import * as fs from "fs";
import * as path from "path";

import * as url from "url";
import * as http from "http";
import * as stream from "stream";
import got from "got";
Expand Down Expand Up @@ -565,27 +564,29 @@ export class Cache {
private static forceRedirect(urlRemote: string, options: FetchOptions) {
if (!options.forceHost && !options.forcePort) return urlRemote;

var urlParts = url.parse(urlRemote);
const urlObj = new URL(urlRemote);
var changed = false;

if (!urlParts.hostname) return urlRemote;
if (!urlObj.hostname) return urlRemote;

if (options.forceHost && urlParts.hostname != options.forceHost) {
urlParts.hostname = options.forceHost;
if (options.forceHost && urlObj.hostname != options.forceHost) {
urlObj.hostname = options.forceHost;
changed = true;
}

if (options.forcePort && urlParts.port != "" + options.forcePort) {
urlParts.port = "" + options.forcePort;
if (options.forcePort && urlObj.port != "" + options.forcePort) {
urlObj.port = "" + options.forcePort;
changed = true;
}

if (!changed) return urlRemote;

urlParts.search = "?host=" + encodeURIComponent(urlParts.host || "");
urlParts.host = null as any;
const originalHost = urlObj.host;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: originalHost is captured from urlObj.host after urlObj.hostname and urlObj.port are modified, resulting in the new host being captured instead of the original one.
Severity: HIGH

Suggested Fix

Capture the original host into a variable immediately after the URL object is created with new URL(urlRemote), before any properties like hostname or port are modified.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/Cache.ts#L584

Potential issue: In the `forceRedirect` function, `urlObj.hostname` and `urlObj.port`
are modified based on `options`. Due to the dynamic nature of the WHATWG `URL` API, the
`urlObj.host` property automatically reflects these changes. The code captures
`urlObj.host` into the `originalHost` variable *after* these modifications have
occurred. This causes `originalHost` to contain the new, modified host instead of the
original one. This incorrect value is then used in the `?host=` query parameter,
breaking downstream logic that depends on receiving the original host.

Did we get this right? 👍 / 👎 to inform future reviews.

urlObj.search = "?host=" + encodeURIComponent(originalHost || "");
Comment on lines +584 to +585

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 originalHost captured after hostname/port mutation, preserving the new values instead of the original

In forceRedirect, the originalHost variable is captured on line 584 after urlObj.hostname and urlObj.port have already been mutated (lines 573, 578). This means originalHost contains the modified host (e.g., localhost:12345) instead of the original host (e.g., example.com). The ?host= query parameter is then set to the wrong value.

Root Cause and Impact

With url.parse, the old code saved the original host before setting host = null:

urlParts.search = "?host=" + encodeURIComponent(urlParts.host || ""); // original host
urlParts.host = null;

In the new code, the WHATWG URL object updates .host reactively when .hostname or .port are set. So by the time line 584 executes:

const originalHost = urlObj.host; // ALREADY 'localhost:12345', not 'example.com'

This breaks the test infrastructure's forceHost/forcePort mechanism, which relies on passing the original host in the ?host= query parameter so the test server knows which virtual host was originally requested.

Impact: The ?host= parameter will always contain the forced (new) host instead of the original host, breaking all redirect-based test scenarios.

Prompt for agents
In src/Cache.ts, in the forceRedirect method, the original host must be captured BEFORE mutating hostname and port. Move the line `const originalHost = urlObj.host;` to before the hostname/port modification block (i.e., before line 572). Specifically, insert `const originalHost = urlObj.host;` right after line 568 (`var changed = false;`), and remove line 584 (`const originalHost = urlObj.host;`).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

// Remove the host to use the modified hostname/port
Comment on lines +572 to +586
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n src/Cache.ts | sed -n '560,600p'

Repository: loanlink-nl/cget

Length of output: 1463


🏁 Script executed:

cat > /tmp/test_url.js << 'EOF'
// Test WHATWG URL API behavior
const url = new URL("http://original.example.com:8080/path");
console.log("Original host:", url.host);
console.log("Original hostname:", url.hostname);
console.log("Original port:", url.port);

// Mutate hostname
url.hostname = "forced.example.com";
console.log("\nAfter hostname mutation:");
console.log("host:", url.host);
console.log("hostname:", url.hostname);
console.log("port:", url.port);

// Mutate port
url.port = "9090";
console.log("\nAfter port mutation:");
console.log("host:", url.host);
console.log("hostname:", url.hostname);
console.log("port:", url.port);
EOF
node /tmp/test_url.js

Repository: loanlink-nl/cget

Length of output: 348


Capture the original host before rewriting hostname/port.

At line 584, originalHost is captured after mutations at lines 572-580, so ?host= contains the forced host instead of the original remote host. Move the capture to before the mutations (after line 567).

Additionally, line 587's manual host reconstruction is redundant—the WHATWG URL API automatically updates the host property when hostname or port change. Remove it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Cache.ts` around lines 572 - 586, The code captures originalHost after
mutating urlObj.hostname/port, causing the ?host query to contain the forced
host; move the capture of originalHost to immediately after urlObj is created
(before any changes to urlObj.hostname or urlObj.port) so it holds the true
original remote host, and remove the redundant manual host reconstruction that
follows (relying on the WHATWG URL API to keep urlObj.host consistent when
hostname/port are changed); update references around urlObj, originalHost,
options.forceHost and options.forcePort accordingly.

urlObj.host = urlObj.hostname + (urlObj.port ? ":" + urlObj.port : "");

return url.format(urlParts);
return urlObj.href;
}

/** Queue for limiting parallel downloads. */
Expand Down
14 changes: 6 additions & 8 deletions test/serve.ts
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ console.log(
import { assignIn } from "lodash";
import * as fs from "fs";
import { fsa } from "../dist/mkdirp";
import * as url from "url";
import * as http from "http";

import { Address, Cache } from "../dist/cget";
Expand Down Expand Up @@ -86,8 +85,8 @@ function checkRemoteLink(cachePath: string) {

var app = http.createServer(
(req: http.IncomingMessage, res: http.ServerResponse) => {
var urlParts = url.parse(req.url);
var args = parseArgs(urlParts.query);
const urlObj = new URL(req.url || "/", `http://${req.headers.host || "localhost"}`);
var args = parseArgs(urlObj.search.substring(1));
Comment on lines +88 to +89
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n test/serve.ts | sed -n '80,120p'

Repository: loanlink-nl/cget

Length of output: 1486


🏁 Script executed:

grep -n "reportError" test/serve.ts

Repository: loanlink-nl/cget

Length of output: 198


🏁 Script executed:

grep -n "function.*reportError\|const reportError\|var reportError" test/serve.ts

Repository: loanlink-nl/cget

Length of output: 84


🏁 Script executed:

cat -n test/serve.ts | sed -n '47,60p'

Repository: loanlink-nl/cget

Length of output: 454


Guard URL construction to prevent unhandled exceptions from crashing the test server.

At line 88, new URL() can throw on malformed req.url or req.headers.host values. Wrap the call in a try-catch block and return a 400 response via reportError().

Proposed fix
-    const urlObj = new URL(req.url || "/", `http://${req.headers.host || "localhost"}`);
-    var args = parseArgs(urlObj.search.substring(1));
+    let urlObj: URL;
+    try {
+      urlObj = new URL(req.url || "/", `http://${req.headers.host || "localhost"}`);
+    } catch {
+      reportError(res, 400);
+      return;
+    }
+    var args = parseArgs(urlObj.search.substring(1));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const urlObj = new URL(req.url || "/", `http://${req.headers.host || "localhost"}`);
var args = parseArgs(urlObj.search.substring(1));
let urlObj: URL;
try {
urlObj = new URL(req.url || "/", `http://${req.headers.host || "localhost"}`);
} catch {
reportError(res, 400);
return;
}
var args = parseArgs(urlObj.search.substring(1));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/serve.ts` around lines 88 - 89, Wrap the URL construction that creates
urlObj (const urlObj = new URL(req.url || "/", `http://${req.headers.host ||
"localhost"}`)) in a try-catch to guard against malformed req.url or
req.headers.host; on error call reportError with a 400 response and a brief
message and return immediately so parseArgs is never called on an invalid
urlObj. Ensure the catch captures the thrown error (e) and includes enough
context in the reportError call to identify the bad URL/host and stops further
request handling.

var host = args["host"];

// TODO does localhost not count as a host?
Expand All @@ -105,13 +104,12 @@ var app = http.createServer(
return;
}

urlParts.protocol = "http";
urlParts.search = null;
urlParts.query = null;
urlParts.host = host;
urlObj.protocol = "http";
urlObj.search = "";
urlObj.host = host;
Comment on lines +107 to +109
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n test/serve.ts | sed -n '100,120p'

Repository: loanlink-nl/cget

Length of output: 710


🏁 Script executed:

cat -n test/serve.ts | sed -n '70,110p'

Repository: loanlink-nl/cget

Length of output: 1534


🌐 Web query:

WHATWG URL host property undefined assignment behavior

💡 Result:

Assigning undefined to URL.prototype.host does not clear the host.

  • URL.host is an IDL USVString, so the JS value undefined is first converted via Web IDL string conversion (i.e., effectively ToString(undefined)"undefined"), then to a scalar-value string. [2]
  • The host setter then runs Basic URL parse on that resulting string, using the current URL record and the host state. [1]
    • So url.host = undefined behaves the same as url.host = "undefined" (it attempts to set the host to the domain "undefined"). [1][2]
  • If the URL has an opaque path (e.g., mailto:), the host setter just returns without changing anything. [1]

Assigning undefined to URL.host produces "undefined" as the literal host string.

When host is missing but the guard allows localhost connections, assigning undefined to urlObj.host at line 109 causes WHATWG URL to convert it to the string "undefined" (as per IDL USVString conversion rules). This corrupts the URL used in the cache lookup at line 112.

     urlObj.protocol = "http";
     urlObj.search = "";
-    urlObj.host = host;
+    urlObj.host = host || "";

Use host || "" to prevent cache key corruption.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
urlObj.protocol = "http";
urlObj.search = "";
urlObj.host = host;
urlObj.protocol = "http";
urlObj.search = "";
urlObj.host = host || "";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/serve.ts` around lines 107 - 109, Assigning undefined to URL.host
(variable urlObj.host) makes the WHATWG URL stringify it as "undefined" and
corrupts the cache lookup; change the assignment to use a safe empty-string
fallback (use host || "" when setting urlObj.host) so missing host won't produce
the literal "undefined" and will keep the cache key consistent for the
subsequent cache lookup that uses urlObj.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Setting urlObj.host = undefined converts host to string "undefined" in serve.ts

In serve.ts line 109, urlObj.host = host is executed where host comes from args["host"] which may be undefined (when there's no ?host= query parameter). With the WHATWG URL API, setting .host to undefined coerces it to the string "undefined", producing URLs like http://undefined:12345/path.

Root Cause and Impact

The old code used urlParts.host = host with url.parse's result object. When host was undefined or null, url.format() would fall back to hostname and port properties, effectively keeping the original host.

With the WHATWG URL object, the .host setter converts any value to a string. So urlObj.host = undefined sets the host to literally "undefined". This happens in the localhost testing flow where no ?host= parameter is present but the request passes the localhost kludge check (lines 97-105).

The resulting URL http://undefined:12345/example.invalid/index.html is passed to new Address(...) which will fail to resolve to any valid cache path.

Impact: The test server will fail to serve any requests from localhost when no ?host= query parameter is provided.

Suggested change
urlObj.host = host;
if (host) urlObj.host = host;
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.


cache
.getCachePath(new Address(url.format(urlParts)))
.getCachePath(new Address(urlObj.href))
.then((cachePath: string) =>
checkRemoteLink(cachePath)
.then((urlRemote: string) => {
Expand Down
Loading