-
-
Couldn't load subscription status.
- Fork 1.6k
feat(Vercel) Build cache improvements #15317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Bundle ReportChanges will decrease total bundle size by 460.9kB (-1.96%) ⬇️. This is within the configured threshold ✅ Detailed changes
Affected Assets, Files, and Routes:view changes for bundle: sentry-docs-client-array-pushAssets Changed:
view changes for bundle: sentry-docs-server-cjsAssets Changed:
Files in
App Routes Affected:
|
- Use VERCEL_GIT_COMMIT_REF (branch name) in cache keys for cross-commit persistence - Include registry data hash in cache key to detect registry updates - Enable caching for 200+ platform-include files (previously skipped) - Add build timing instrumentation - Expected: 18 min → 2-3 min on first build, ~2 min on subsequent commits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great time savings. That said we should address the issues I raised either before merging or in a quick follow up.
Also, the extra comments are mostly stating the obvious (vibe code artifacts?) and better removed.
src/mdx.ts
Outdated
| let lastSummaryLog = Date.now(); | ||
| function logCacheSummary(force = false) { | ||
| const now = Date.now(); | ||
| // Log every 30 seconds or when forced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic seems unnecessary? Why not just emit at the end?
| const skipCache = | ||
| // Check if file depends on Release Registry | ||
| const dependsOnRegistry = | ||
| source.includes('@inject') || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this @inject thing was related to the registry
src/mdx.ts
Outdated
| if (cachedRegistryHash) { | ||
| return cachedRegistryHash; | ||
| } | ||
| const [apps, packages] = await Promise.all([getAppRegistry(), getPackageRegistry()]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a race condition here: if you call this function 3 times back to back, it would make 3 separate calls.
What you need for proper caching is to change the type of cachedRegistryHash to Promise<string>, and do:
cachedRegistryHash = Promise.all(...). then(([apps, packages]) => md5(...));
return cachedRegistryHash;
src/mdx.ts
Outdated
| // Get registry hash (cached per worker to avoid redundant fetches) | ||
| const registryHash = await getRegistryHash(); | ||
| cacheKey = `${sourceHash}-${registryHash}`; | ||
| } catch (err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic can and probably should be improved: the only way this can throw an exception should be a network related issue. In that case, pages depending on the registry will also have a problem so the try-catch is redundant. It's also wasteful as if it raises an exception, that means it will raise an exception for every single page.
I'd rather add a retry mechanism into the cache key function and don't handle the exception if the retried fail, halting the build as we need the registry connection for the build.
| const leanHTML = rawHTML | ||
| // Remove all script tags (build IDs, chunk hashes, Vercel injections) | ||
| .replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '') |
Check failure
Code scanning / CodeQL
Incomplete multi-character sanitization High
<script
| // Remove elements that change between builds but don't affect markdown output | ||
| const leanHTML = rawHTML | ||
| // Remove all script tags (build IDs, chunk hashes, Vercel injections) | ||
| .replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '') |
Check failure
Code scanning / CodeQL
Bad HTML filtering regexp High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 1 day ago
The best way to fix this problem is to use a proper HTML parser to remove unwanted tags (such as <script>, <link>, and <meta>), rather than relying on regular expressions. This provides more robust handling of HTML's intricacies, such as extra whitespace, unusual attribute formatting, and invalid but tolerated browser syntax. Since the script already imports rehype-parse (for parsing HTML to a syntax tree) and other tools from the unified/rehype ecosystem, the fix can use these existing libraries.
Specifically, instead of using .replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '') (and similar regex for <link> and <meta>), we should parse the HTML into an AST, programmatically remove the unwanted nodes, and then serialize the AST back to HTML for further processing. This fix should be applied within the genMDFromHTML function, replacing the leanHTML construction (lines 233–242) with parser-based routines.
No new dependencies are needed since rehype-parse, unist-util-remove, and related packages are already imported. We'll need to use unified().use(rehypeParse, {fragment: true}) to parse the HTML, use remove(tree, test) from unist-util-remove to strip undesired nodes, and a rehype serializer (e.g., rehype-stringify) to convert the AST back to HTML. If not already available, we should add a rehype-stringify import.
-
Copy modified line R29 -
Copy modified lines R234-R237 -
Copy modified lines R239-R271
| @@ -26,6 +26,7 @@ | ||
| import remarkStringify from 'remark-stringify'; | ||
| import {unified} from 'unified'; | ||
| import {remove} from 'unist-util-remove'; | ||
| import rehypeStringify from 'rehype-stringify'; | ||
|
|
||
| const DOCS_ORIGIN = 'https://docs.sentry.io'; | ||
| const CACHE_VERSION = 3; | ||
| @@ -230,17 +231,44 @@ | ||
|
|
||
| // Normalize HTML to make cache keys deterministic across builds | ||
| // Remove elements that change between builds but don't affect markdown output | ||
| const leanHTML = rawHTML | ||
| // Remove all script tags (build IDs, chunk hashes, Vercel injections) | ||
| .replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '') | ||
| // Remove link tags for stylesheets and preloads (chunk hashes change) | ||
| .replace(/<link[^>]*>/gi, '') | ||
| // Remove meta tags that might have build-specific content | ||
| .replace(/<meta name="next-size-adjust"[^>]*>/gi, '') | ||
| // Remove data attributes that Next.js/Vercel add (build IDs, etc.) | ||
| .replace(/\s+data-next-[a-z-]+="[^"]*"/gi, '') | ||
| .replace(/\s+data-nextjs-[a-z-]+="[^"]*"/gi, ''); | ||
| // Remove all <script>, <link>, and next-size-adjust <meta> tags, as well as data-* attributes, using an HTML parser. | ||
| const parsedHtmlTree = unified() | ||
| .use(rehypeParse, {fragment: true}) | ||
| .parse(rawHTML); | ||
|
|
||
| // Remove unwanted elements using unist-util-remove | ||
| // Remove <script> tags | ||
| remove(parsedHtmlTree, (node) => node.type === 'element' && node.tagName === 'script'); | ||
| // Remove <link> tags | ||
| remove(parsedHtmlTree, (node) => node.type === 'element' && node.tagName === 'link'); | ||
| // Remove <meta name="next-size-adjust" ...> | ||
| remove(parsedHtmlTree, (node) => | ||
| node.type === 'element' && | ||
| node.tagName === 'meta' && | ||
| node.properties && | ||
| node.properties.name === 'next-size-adjust' | ||
| ); | ||
| // Remove data-next-* and data-nextjs-* attributes from all elements | ||
| function cleanseDataAttrs(node) { | ||
| if (node && node.type === 'element' && node.properties) { | ||
| Object.keys(node.properties).forEach((key) => { | ||
| if (/^data-next(-|js-)/.test(key)) { | ||
| delete node.properties[key]; | ||
| } | ||
| }); | ||
| } | ||
| if (node.children) { | ||
| node.children.forEach(cleanseDataAttrs); | ||
| } | ||
| } | ||
| cleanseDataAttrs(parsedHtmlTree); | ||
|
|
||
| // Convert AST back to HTML | ||
| const leanHTML = unified() | ||
| .use(() => (tree) => tree) // identity plugin since tree already processed | ||
| .use(rehypeStringify) | ||
| .stringify(parsedHtmlTree); | ||
|
|
||
| if (shouldDebug) { | ||
| console.log( | ||
| `✂️ Lean HTML length: ${leanHTML.length} chars (removed ${rawHTML.length - leanHTML.length} chars)` |
DESCRIBE YOUR PR
IS YOUR CHANGE URGENT?
Help us prioritize incoming PRs by letting us know when the change needs to go live.
SLA
Thanks in advance for your help!
PRE-MERGE CHECKLIST
Make sure you've checked the following before merging your changes:
LEGAL BOILERPLATE
Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. and is gonna need some rights from me in order to utilize my contributions in this here PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.
EXTRA RESOURCES