Skip to content

Add sitemap.xml#2669

Open
waxlamp wants to merge 12 commits intomasterfrom
sitemapxml
Open

Add sitemap.xml#2669
waxlamp wants to merge 12 commits intomasterfrom
sitemapxml

Conversation

@waxlamp
Copy link
Member

@waxlamp waxlamp commented Dec 12, 2025

This PR adds a sitemap.xml view to the DANDI backend that should help with Google presenting search results for individual Dandisets. Together with activating prerendering through Netlify (see #2663), this should solve most of our (current) SEO problems.

Details

  • The sitemap.xml view is generated by Django's sitemaps framework
  • A new robots.txt view presents a Sitemap record to visiting crawlers
  • Both of these views are generated serverside and proxied to the frontend (using vite server proxy config for dev, and Netlify redirects for the sandbox and production deployments)
  • While the sitemap is always generated, there's a new setting to control whether the sitemap entries are generated. This is because we will not want all DANDI deployments to be search engine indexed. The most salient example is the sandbox deployment: we only want Google to turn up references to production, so we'll leave this turned off for sandbox. This setting also controls whether the Sitemap: directive appears in the robots.txt view.

Deployment

When this PR is merged/released, it will not have any immediate effect since the feature defaults to inactive. To activate it on select deployments, we need to set the DJANGO_DANDI_ENABLE_SITEMAP_XML environment variable to True; to do so for our prod deployment, the best way would be to send a PR to the dandi-infrastructure repo.

TODO

  • after merge/release, set up a dandi-infrastructure PR to activate for prod
  • after merge/release, submit site to Google for reindexing

Closes #752 (this was already closed, but the current PR addresses it)
Closes #2230

We don't want every DANDI deployment indexable by search engines. For
example, we will want to activate this for the prod deployment, but
leave it off for sandbox.
Copy link
Member Author

@waxlamp waxlamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is pretty straightforward, so I'm just looking for obvious pitfalls.

DANDI_DEV_EMAIL: str
DANDI_ADMIN_EMAIL: str

DANDI_ENABLE_SITEMAP_XML: bool = env.bool('DJANGO_DANDI_ENABLE_SITEMAP_XML', default=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better name for the env var than this?

Comment on lines +10 to +11
changefreq = 'weekly'
priority = 0.5
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These properties are defined at sitemaps.org and they seem to be merely advisory, so I tried to pick reasonable values for them.

"Weekly" seems like a good average-ish value for how frequently Dandisets might change.

The priority value is supposed to grant importance to some entries over others; since we're not doing that, a nice middle-of-the-road figure like 0.5 seems appropriate.

DJANGO_DANDI_INSTANCE_NAME=DEV-DANDI
DJANGO_DANDI_INSTANCE_IDENTIFIER=RRID:ABC_123456
DJANGO_DANDI_DOI_API_PREFIX=10.80507
DJANGO_DANDI_ENABLE_SITEMAP_XML=True
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set this to True here so that the sitemaps are available for inspection, etc., in dev.

Copy link
Member

@jjnesbitt jjnesbitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. My only lingering concern is that we're generating sitemaps for a different domain than the one the API is hosted at (api.dandiarchive.org vs dandiarchive.org). Is there any issue with this?

Disallow: /""")

if settings.DANDI_ENABLE_SITEMAP_XML:
parts.append(f'Sitemap: {settings.DANDI_API_URL}/sitemap.xml')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any way to verify that crawlers will properly pick this up?

Copy link
Member Author

@waxlamp waxlamp Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can (and should) manually submit the site to Google to reindex, but my previous experiment with SEO stuff (see #2663 (comment)) "just worked" after waiting a few days. So I think this will also be a bit of a wait-and-see.

I added some TODOs to the PR description to cover some operational stuff we need to do after merge.

Co-authored-by: Jacob Nesbitt <jjnesbitt2@gmail.com>
@waxlamp waxlamp marked this pull request as draft December 22, 2025 16:21
@waxlamp
Copy link
Member Author

waxlamp commented Dec 22, 2025

My only lingering concern is that we're generating sitemaps for a different domain than the one the API is hosted at (api.dandiarchive.org vs dandiarchive.org). Is there any issue with this?

Missed this question in my initial response, sorry.

Yes, I think this is a problem. According to the sitemaps protocol, a sitemap on a given host should only reference URLs that live on that host.

I have an idea, stand by.

This is to prevent crawlers from reading the sitemap directly from the
backend, since it references URLs not on the same host, which is not
allowed (see https://www.sitemaps.org/protocol.html#location).

This is in preparation for generating the sitemap server-side, but
proxying it to the frontend so that the hosts align properly.
This copies the pattern established by the "server-info-build" plugin.
With this change, the backend always generates a sitemap file, but the
ENABLE setting governs whether any URL entries are included in it. This
simplifies deployment concerns: the backend always generates a sitemap,
and the frontend always proxies it to `/sitemap.xml`, but now a
deployment not wishing to advertise DLPs will simply not show any of
those URLs.
@waxlamp waxlamp changed the title Add sitemap.xml to backend Add sitemap.xml Dec 23, 2025
Comment on lines +1 to 3
VITE_APP_DANDI_BACKEND_ROOT=http://localhost:8000
VITE_APP_DANDI_API_ROOT=http://localhost:8000/api/
VITE_APP_OAUTH_API_ROOT=http://localhost:8000/oauth/
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these can be refactored now to inject the DANDI and API variables into the config at runtime, but it seemed like it was out of scope for this PR.

@waxlamp waxlamp requested a review from jjnesbitt December 24, 2025 23:38
@waxlamp
Copy link
Member Author

waxlamp commented Dec 24, 2025

@jjnesbitt, this is ready for another review. There are some choices in the PR now that may seem a bit odd, so I'm happy to discuss any time.

@waxlamp waxlamp marked this pull request as ready for review December 24, 2025 23:39
This commit restores the original robots.txt view for the backend, and
creates a new one specifically for the frontend. It's done this way to
make it easier to generate the frontend's sitemap.xml URL within
robots.txt (and it piggybacks off the similar proxying trick for
sitemap.xml itself).
waxlamp and others added 2 commits January 21, 2026 16:16
Co-authored-by: Jacob Nesbitt <jjnesbitt2@gmail.com>
This is to be totally explicit with, e.g., Googlebot that when we're not
serving a sitemap file, there isn't even an empty file to present. This
matters only for minor reasons, such as avoiding warnings from Google
Search Console about an empty sitemap, etc. Or, put another way, it
eliminates any possible issues that might crop up when presenting an
empty sitemap file to search engines (as opposed to not presenting any
file at all).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Even better indexing for search engines: sitemap.xml Better indexing for search engines

3 participants