Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 84 additions & 9 deletions scripts/generate-md-exports.mjs
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
#!/usr/bin/env node
// testing cache
/* eslint-disable no-console */
import {ListObjectsV2Command, PutObjectCommand, S3Client} from '@aws-sdk/client-s3';
import imgLinks from '@pondorasti/remark-img-links';
import {selectAll} from 'hast-util-select';
import {createHash} from 'node:crypto';
import {createReadStream, createWriteStream, existsSync} from 'node:fs';
import {mkdir, opendir, readFile, rm, writeFile} from 'node:fs/promises';
import {mkdir, opendir, readdir, readFile, rm, writeFile} from 'node:fs/promises';
import {cpus} from 'node:os';
import * as path from 'node:path';
import {compose, Readable} from 'node:stream';
Expand Down Expand Up @@ -91,12 +92,15 @@
console.log(`💰 Cache directory: ${CACHE_DIR}`);
const noCache = !existsSync(CACHE_DIR);
if (noCache) {
console.log(`ℹ️ No cache directory found, this will take a while...`);
console.log(`ℹ️ No cache directory found, creating fresh cache...`);
await mkdir(CACHE_DIR, {recursive: true});
} else {
console.log(`✅ Cache directory exists, will attempt to use cached files`);
}

// On a 16-core machine, 8 workers were optimal (and slightly faster than 16)
const numWorkers = Math.max(Math.floor(cpus().length / 2), 2);
// Use 75% of CPU cores for optimal performance
const numWorkers = Math.max(Math.floor(cpus().length * 0.75), 2);
console.log(`⚙️ Using ${numWorkers} workers for ${cpus().length} CPU cores`);
const workerTasks = new Array(numWorkers).fill(null).map(() => []);

let existingFilesOnR2 = null;
Expand Down Expand Up @@ -196,13 +200,60 @@

const md5 = data => createHash('md5').update(data).digest('hex');

// Initialize debug counter
genMDFromHTML.debugCount = 0;

async function genMDFromHTML(source, target, {cacheDir, noCache}) {
const leanHTML = (await readFile(source, {encoding: 'utf8'}))
// Remove all script tags, as they are not needed in markdown
// and they are not stable across builds, causing cache misses
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '');
const rawHTML = await readFile(source, {encoding: 'utf8'});

// Debug: Log first 3 files to understand what's being removed
const shouldDebug = genMDFromHTML.debugCount < 3;
if (shouldDebug) {
genMDFromHTML.debugCount++;
const fileName = path.basename(source);
console.log(`\n🔍 DEBUG: Processing ${fileName}`);
console.log(`📏 Raw HTML length: ${rawHTML.length} chars`);

// Extract what we're removing to see if it's stable
const scripts = rawHTML.match(/<script[^>]*src="[^"]*"/gi);
const links = rawHTML.match(/<link[^>]*>/gi);

console.log(`📦 Found ${scripts?.length || 0} script tags with src`);
if (scripts && scripts.length > 0) {
console.log(` First 3: ${scripts.slice(0, 3).join(', ')}`);
}
console.log(`🔗 Found ${links?.length || 0} link tags`);
if (links && links.length > 0) {
console.log(` First 3: ${links.slice(0, 3).join(', ')}`);
}
}

// Normalize HTML to make cache keys deterministic across builds
// Remove elements that change between builds but don't affect markdown output
const leanHTML = rawHTML
// Remove all script tags (build IDs, chunk hashes, Vercel injections)
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
Comment on lines +233 to +235

Check failure

Code scanning / CodeQL

Incomplete multi-character sanitization High

This string may still contain
<script
, which may cause an HTML element injection vulnerability.

Check failure

Code scanning / CodeQL

Bad HTML filtering regexp High

This regular expression does not match script end tags like </script >.

Copilot Autofix

AI 2 days ago

The best way to fix this problem is to use a proper HTML parser to remove unwanted tags (such as <script>, <link>, and <meta>), rather than relying on regular expressions. This provides more robust handling of HTML's intricacies, such as extra whitespace, unusual attribute formatting, and invalid but tolerated browser syntax. Since the script already imports rehype-parse (for parsing HTML to a syntax tree) and other tools from the unified/rehype ecosystem, the fix can use these existing libraries.

Specifically, instead of using .replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '') (and similar regex for <link> and <meta>), we should parse the HTML into an AST, programmatically remove the unwanted nodes, and then serialize the AST back to HTML for further processing. This fix should be applied within the genMDFromHTML function, replacing the leanHTML construction (lines 233–242) with parser-based routines.

No new dependencies are needed since rehype-parse, unist-util-remove, and related packages are already imported. We'll need to use unified().use(rehypeParse, {fragment: true}) to parse the HTML, use remove(tree, test) from unist-util-remove to strip undesired nodes, and a rehype serializer (e.g., rehype-stringify) to convert the AST back to HTML. If not already available, we should add a rehype-stringify import.


Suggested changeset 1
scripts/generate-md-exports.mjs

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/scripts/generate-md-exports.mjs b/scripts/generate-md-exports.mjs
--- a/scripts/generate-md-exports.mjs
+++ b/scripts/generate-md-exports.mjs
@@ -26,6 +26,7 @@
 import remarkStringify from 'remark-stringify';
 import {unified} from 'unified';
 import {remove} from 'unist-util-remove';
+import rehypeStringify from 'rehype-stringify';
 
 const DOCS_ORIGIN = 'https://docs.sentry.io';
 const CACHE_VERSION = 3;
@@ -230,17 +231,44 @@
 
   // Normalize HTML to make cache keys deterministic across builds
   // Remove elements that change between builds but don't affect markdown output
-  const leanHTML = rawHTML
-    // Remove all script tags (build IDs, chunk hashes, Vercel injections)
-    .replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
-    // Remove link tags for stylesheets and preloads (chunk hashes change)
-    .replace(/<link[^>]*>/gi, '')
-    // Remove meta tags that might have build-specific content
-    .replace(/<meta name="next-size-adjust"[^>]*>/gi, '')
-    // Remove data attributes that Next.js/Vercel add (build IDs, etc.)
-    .replace(/\s+data-next-[a-z-]+="[^"]*"/gi, '')
-    .replace(/\s+data-nextjs-[a-z-]+="[^"]*"/gi, '');
+  // Remove all <script>, <link>, and next-size-adjust <meta> tags, as well as data-* attributes, using an HTML parser.
+  const parsedHtmlTree = unified()
+    .use(rehypeParse, {fragment: true})
+    .parse(rawHTML);
 
+  // Remove unwanted elements using unist-util-remove
+  // Remove <script> tags
+  remove(parsedHtmlTree, (node) => node.type === 'element' && node.tagName === 'script');
+  // Remove <link> tags
+  remove(parsedHtmlTree, (node) => node.type === 'element' && node.tagName === 'link');
+  // Remove <meta name="next-size-adjust" ...>
+  remove(parsedHtmlTree, (node) =>
+    node.type === 'element' &&
+    node.tagName === 'meta' &&
+    node.properties &&
+    node.properties.name === 'next-size-adjust'
+  );
+  // Remove data-next-* and data-nextjs-* attributes from all elements
+  function cleanseDataAttrs(node) {
+    if (node && node.type === 'element' && node.properties) {
+      Object.keys(node.properties).forEach((key) => {
+        if (/^data-next(-|js-)/.test(key)) {
+          delete node.properties[key];
+        }
+      });
+    }
+    if (node.children) {
+      node.children.forEach(cleanseDataAttrs);
+    }
+  }
+  cleanseDataAttrs(parsedHtmlTree);
+
+  // Convert AST back to HTML
+  const leanHTML = unified()
+    .use(() => (tree) => tree) // identity plugin since tree already processed
+    .use(rehypeStringify)
+    .stringify(parsedHtmlTree);
+
   if (shouldDebug) {
     console.log(
       `✂️  Lean HTML length: ${leanHTML.length} chars (removed ${rawHTML.length - leanHTML.length} chars)`
EOF
@@ -26,6 +26,7 @@
import remarkStringify from 'remark-stringify';
import {unified} from 'unified';
import {remove} from 'unist-util-remove';
import rehypeStringify from 'rehype-stringify';

const DOCS_ORIGIN = 'https://docs.sentry.io';
const CACHE_VERSION = 3;
@@ -230,17 +231,44 @@

// Normalize HTML to make cache keys deterministic across builds
// Remove elements that change between builds but don't affect markdown output
const leanHTML = rawHTML
// Remove all script tags (build IDs, chunk hashes, Vercel injections)
.replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
// Remove link tags for stylesheets and preloads (chunk hashes change)
.replace(/<link[^>]*>/gi, '')
// Remove meta tags that might have build-specific content
.replace(/<meta name="next-size-adjust"[^>]*>/gi, '')
// Remove data attributes that Next.js/Vercel add (build IDs, etc.)
.replace(/\s+data-next-[a-z-]+="[^"]*"/gi, '')
.replace(/\s+data-nextjs-[a-z-]+="[^"]*"/gi, '');
// Remove all <script>, <link>, and next-size-adjust <meta> tags, as well as data-* attributes, using an HTML parser.
const parsedHtmlTree = unified()
.use(rehypeParse, {fragment: true})
.parse(rawHTML);

// Remove unwanted elements using unist-util-remove
// Remove <script> tags
remove(parsedHtmlTree, (node) => node.type === 'element' && node.tagName === 'script');
// Remove <link> tags
remove(parsedHtmlTree, (node) => node.type === 'element' && node.tagName === 'link');
// Remove <meta name="next-size-adjust" ...>
remove(parsedHtmlTree, (node) =>
node.type === 'element' &&
node.tagName === 'meta' &&
node.properties &&
node.properties.name === 'next-size-adjust'
);
// Remove data-next-* and data-nextjs-* attributes from all elements
function cleanseDataAttrs(node) {
if (node && node.type === 'element' && node.properties) {
Object.keys(node.properties).forEach((key) => {
if (/^data-next(-|js-)/.test(key)) {
delete node.properties[key];
}
});
}
if (node.children) {
node.children.forEach(cleanseDataAttrs);
}
}
cleanseDataAttrs(parsedHtmlTree);

// Convert AST back to HTML
const leanHTML = unified()
.use(() => (tree) => tree) // identity plugin since tree already processed
.use(rehypeStringify)
.stringify(parsedHtmlTree);

if (shouldDebug) {
console.log(
`✂️ Lean HTML length: ${leanHTML.length} chars (removed ${rawHTML.length - leanHTML.length} chars)`
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
// Remove link tags for stylesheets and preloads (chunk hashes change)
.replace(/<link[^>]*>/gi, '')
// Remove meta tags that might have build-specific content
.replace(/<meta name="next-size-adjust"[^>]*>/gi, '')
// Remove data attributes that Next.js/Vercel add (build IDs, etc.)
.replace(/\s+data-next-[a-z-]+="[^"]*"/gi, '')
.replace(/\s+data-nextjs-[a-z-]+="[^"]*"/gi, '');

if (shouldDebug) {
console.log(
`✂️ Lean HTML length: ${leanHTML.length} chars (removed ${rawHTML.length - leanHTML.length} chars)`
);
}

const cacheKey = `v${CACHE_VERSION}_${md5(leanHTML)}`;
const cacheFile = path.join(cacheDir, cacheKey);

if (shouldDebug) {
console.log(`🔑 Cache key: ${cacheKey}`);
}

if (!noCache) {
try {
const data = await text(
Expand All @@ -214,6 +265,17 @@
} catch (err) {
if (err.code !== 'ENOENT') {
console.warn(`Error using cache file ${cacheFile}:`, err);
} else if (shouldDebug) {
// Cache miss on debug file - show what's in cache
console.log(`❌ Cache miss! Looking for: ${cacheKey}`);
try {
const allCacheFiles = await readdir(cacheDir);
const v3Files = allCacheFiles.filter(f => f.startsWith('v3_')).slice(0, 3);
console.log(` Existing v3 files in cache:`);
v3Files.forEach(f => console.log(` - ${f}`));
} catch (e) {
console.log(` Could not read cache dir: ${e.message}`);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Debug Code and Logging Statements Left In

Debug code and logging statements have been accidentally committed. This includes:

  1. Debug counter initialization at line 203: genMDFromHTML.debugCount = 0;
  2. Debug logging for the first 3 files processed (lines 209-228): logs raw HTML length, script tags, link tags
  3. Debug logging showing HTML compression stats (lines 235-237)
  4. Debug logging showing cache key (lines 242-244)
  5. Debug logging showing cache directory contents on miss (lines 257-268): lists existing v3 cache files

These debug statements were likely left in during investigation of build cache behavior and should be removed before production deployment. They will produce unnecessary console output and logging overhead during normal builds.

Fix in Cursor Fix in Web

}
}
}
Expand Down Expand Up @@ -311,8 +373,11 @@
const s3Client = getS3Client();
const failedTasks = [];
let cacheMisses = [];
let cacheHits = 0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Stream Duplication Needed for Parallel Pipelines

The same readable stream (reader) is being used in two separate pipelines simultaneously. In Node.js, a readable stream can only be consumed once. When the first pipeline consumes the stream, it will end it, causing the second pipeline to fail with no data. This should be fixed by creating two separate readable streams: const reader1 = Readable.from(data); const reader2 = Readable.from(data); and using reader1 in the first pipeline and reader2 in the second.

Fix in Cursor Fix in Web

let r2CacheMisses = [];
console.log(`🤖 Worker[${id}]: Starting to process ${tasks.length} files...`);
console.log(
`🤖 Worker[${id}]: Starting to process ${tasks.length} files... (noCache=${noCache})`
);
for (const {sourcePath, targetPath, relativePath, r2Hash} of tasks) {
try {
const {data, cacheHit} = await genMDFromHTML(sourcePath, targetPath, {
Expand All @@ -321,6 +386,8 @@
});
if (!cacheHit) {
cacheMisses.push(relativePath);
} else {
cacheHits++;
}

if (r2Hash !== null) {
Expand All @@ -336,6 +403,14 @@
}
}
const success = tasks.length - failedTasks.length;

// Log cache statistics
const cacheHitRate = ((cacheHits / tasks.length) * 100).toFixed(1);
const cacheMissRate = ((cacheMisses.length / tasks.length) * 100).toFixed(1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Zero Tasks Cause NaN Cache Rates

Division by zero when tasks.length is 0. If a worker is assigned zero tasks (which can happen when there are fewer files than workers), the cache hit rate and miss rate calculations will produce NaN values because 0/0 is undefined in JavaScript. This would result in "NaN%" being logged to the console.

Fix in Cursor Fix in Web

console.log(
`📊 Worker[${id}]: Cache stats - ${cacheHits}/${tasks.length} hits (${cacheHitRate}%), ${cacheMisses.length} misses (${cacheMissRate}%)`
);

if (r2CacheMisses.length / tasks.length > 0.1) {
console.warn(
`⚠️ Worker[${id}]: More than 10% of files had a different hash on R2 with the generation process.`
Expand Down
151 changes: 122 additions & 29 deletions src/mdx.ts
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,80 @@ if (process.env.CI) {
mkdirSync(CACHE_DIR, {recursive: true});
}

// Cache registry hash per worker to avoid recomputing for every file
let cachedRegistryHash: Promise<string> | null = null;
async function getRegistryHashWithRetry(
maxRetries = 3,
initialDelayMs = 1000
): Promise<string> {
let lastError: Error | null = null;
let delayMs = initialDelayMs;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
const [apps, packages] = await Promise.all([
getAppRegistry(),
getPackageRegistry(),
]);
return md5(JSON.stringify({apps, packages}));
} catch (err) {
lastError = err as Error;
if (attempt < maxRetries) {
const currentDelay = delayMs;
// eslint-disable-next-line no-console
console.warn(
`Failed to fetch registry (attempt ${attempt + 1}/${maxRetries + 1}): ${lastError.message}. Retrying in ${currentDelay}ms...`
);
await new Promise(resolve => setTimeout(resolve, currentDelay));
delayMs *= 2; // Exponential backoff
}
}
}
throw new Error(
`Failed to fetch registry after ${maxRetries + 1} attempts: ${lastError?.message}`
);
}

function getRegistryHash(): Promise<string> {
if (!cachedRegistryHash) {
cachedRegistryHash = getRegistryHashWithRetry().catch(err => {
// Clear cache on error to allow retry on next call
cachedRegistryHash = null;
throw err;
});
}
return cachedRegistryHash;
}

// Track cache statistics per worker (silent tracking)
const cacheStats = {
registryHits: 0,
registryMisses: 0,
uniqueRegistryFiles: new Set<string>(),
};

// Log summary at end
function logCacheSummary() {
const total = cacheStats.registryHits + cacheStats.registryMisses;
if (total === 0) {
return;
}

const hitRate = ((cacheStats.registryHits / total) * 100).toFixed(1);
const uniqueFiles = cacheStats.uniqueRegistryFiles.size;

// eslint-disable-next-line no-console
console.log(
`📊 [MDX Cache] ${cacheStats.registryHits}/${total} registry files cached (${hitRate}% hit rate, ${uniqueFiles} unique files)`
);
}

// Log final summary when worker exits
if (typeof process !== 'undefined') {
process.on('beforeExit', () => {
logCacheSummary();
});
}

const md5 = (data: BinaryLike) => createHash('md5').update(data).digest('hex');

async function readCacheFile<T>(file: string): Promise<T> {
Expand Down Expand Up @@ -209,6 +283,7 @@ export async function getDevDocsFrontMatterUncached(): Promise<FrontMatter[]> {
)
)
).filter(isNotNil);

return frontMatters;
}

Expand Down Expand Up @@ -396,6 +471,7 @@ async function getAllFilesFrontMatter(): Promise<FrontMatter[]> {
);
}
}

return allFrontMatter;
}

Expand Down Expand Up @@ -531,45 +607,61 @@ export async function getFileBySlug(slug: string): Promise<SlugFile> {
const outdir = path.join(root, 'public', 'mdx-images');
await mkdir(outdir, {recursive: true});

// If the file contains content that depends on the Release Registry (such as an SDK's latest version), avoid using the cache for that file, i.e. always rebuild it.
// This is because the content from the registry might have changed since the last time the file was cached.
// If a new component that injects content from the registry is introduced, it should be added to the patterns below.
const skipCache =
// Check if file depends on Release Registry
const dependsOnRegistry =
source.includes('@inject') ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this @inject thing was related to the registry

source.includes('<PlatformSDKPackageName') ||
source.includes('<LambdaLayerDetail');

if (process.env.CI) {
if (skipCache) {
// eslint-disable-next-line no-console
console.info(
`Not using cached version of ${sourcePath}, as its content depends on the Release Registry`
);
// Build cache key from source content
const sourceHash = md5(source);

// For files that depend on registry, include registry version in cache key
// This prevents serving stale content when registry is updated
if (dependsOnRegistry) {
// Get registry hash (cached per worker to avoid redundant fetches)
// If this fails, the build will fail - registry is required for these files
const registryHash = await getRegistryHash();
cacheKey = `${sourceHash}-${registryHash}`;
} else {
cacheKey = md5(source);
cacheFile = path.join(CACHE_DIR, `${cacheKey}.br`);
assetsCacheDir = path.join(CACHE_DIR, cacheKey);
// Regular files without registry dependencies
cacheKey = sourceHash;
}

try {
const [cached, _] = await Promise.all([
readCacheFile<SlugFile>(cacheFile),
cp(assetsCacheDir, outdir, {recursive: true}),
]);
return cached;
} catch (err) {
if (
err.code !== 'ENOENT' &&
err.code !== 'ABORT_ERR' &&
err.code !== 'Z_BUF_ERROR'
) {
// If cache is corrupted, ignore and proceed
// eslint-disable-next-line no-console
console.warn(`Failed to read MDX cache: ${cacheFile}`, err);
}
cacheFile = path.join(CACHE_DIR, `${cacheKey}.br`);
assetsCacheDir = path.join(CACHE_DIR, cacheKey);

try {
const [cached, _] = await Promise.all([
readCacheFile<SlugFile>(cacheFile),
cp(assetsCacheDir, outdir, {recursive: true}),
]);
// Track cache hit silently
if (dependsOnRegistry) {
cacheStats.registryHits++;
cacheStats.uniqueRegistryFiles.add(sourcePath);
}
return cached;
} catch (err) {
if (
err.code !== 'ENOENT' &&
err.code !== 'ABORT_ERR' &&
err.code !== 'Z_BUF_ERROR'
) {
// If cache is corrupted, ignore and proceed
// eslint-disable-next-line no-console
console.warn(`Failed to read MDX cache: ${cacheFile}`, err);
}
}
}

// Track cache miss silently
if (process.env.CI && dependsOnRegistry) {
cacheStats.registryMisses++;
cacheStats.uniqueRegistryFiles.add(sourcePath);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Cache Miss Tracking Fails in Non-CI Environments

The cache miss tracking for registry-dependent files occurs unconditionally, even when process.env.CI is false and the cache system is not being used. The condition should be if (process.env.CI && dependsOnRegistry) instead of just if (dependsOnRegistry) to avoid recording false cache misses when outside of CI environments. This causes misleading cache statistics when the caching system isn't active.

Fix in Cursor Fix in Web


process.env.ESBUILD_BINARY_PATH = path.join(
root,
'node_modules',
Expand Down Expand Up @@ -700,7 +792,8 @@ export async function getFileBySlug(slug: string): Promise<SlugFile> {
},
};

if (assetsCacheDir && cacheFile && !skipCache) {
// Save to cache if we have a cache key (we now cache everything, including registry-dependent files)
if (assetsCacheDir && cacheFile && cacheKey) {
await cp(assetsCacheDir, outdir, {recursive: true});
writeCacheFile(cacheFile, JSON.stringify(resultObj)).catch(e => {
// eslint-disable-next-line no-console
Expand Down
Loading