Conversation
We don't want every DANDI deployment indexable by search engines. For example, we will want to activate this for the prod deployment, but leave it off for sandbox.
waxlamp
left a comment
There was a problem hiding this comment.
I think this is pretty straightforward, so I'm just looking for obvious pitfalls.
| DANDI_DEV_EMAIL: str | ||
| DANDI_ADMIN_EMAIL: str | ||
|
|
||
| DANDI_ENABLE_SITEMAP_XML: bool = env.bool('DJANGO_DANDI_ENABLE_SITEMAP_XML', default=False) |
There was a problem hiding this comment.
Is there a better name for the env var than this?
| changefreq = 'weekly' | ||
| priority = 0.5 |
There was a problem hiding this comment.
These properties are defined at sitemaps.org and they seem to be merely advisory, so I tried to pick reasonable values for them.
"Weekly" seems like a good average-ish value for how frequently Dandisets might change.
The priority value is supposed to grant importance to some entries over others; since we're not doing that, a nice middle-of-the-road figure like 0.5 seems appropriate.
| DJANGO_DANDI_INSTANCE_NAME=DEV-DANDI | ||
| DJANGO_DANDI_INSTANCE_IDENTIFIER=RRID:ABC_123456 | ||
| DJANGO_DANDI_DOI_API_PREFIX=10.80507 | ||
| DJANGO_DANDI_ENABLE_SITEMAP_XML=True |
There was a problem hiding this comment.
I set this to True here so that the sitemaps are available for inspection, etc., in dev.
jjnesbitt
left a comment
There was a problem hiding this comment.
Looks good overall. My only lingering concern is that we're generating sitemaps for a different domain than the one the API is hosted at (api.dandiarchive.org vs dandiarchive.org). Is there any issue with this?
dandiapi/api/views/robots.py
Outdated
| Disallow: /""") | ||
|
|
||
| if settings.DANDI_ENABLE_SITEMAP_XML: | ||
| parts.append(f'Sitemap: {settings.DANDI_API_URL}/sitemap.xml') |
There was a problem hiding this comment.
Do we have any way to verify that crawlers will properly pick this up?
There was a problem hiding this comment.
We can (and should) manually submit the site to Google to reindex, but my previous experiment with SEO stuff (see #2663 (comment)) "just worked" after waiting a few days. So I think this will also be a bit of a wait-and-see.
I added some TODOs to the PR description to cover some operational stuff we need to do after merge.
Co-authored-by: Jacob Nesbitt <jjnesbitt2@gmail.com>
Missed this question in my initial response, sorry. Yes, I think this is a problem. According to the sitemaps protocol, a sitemap on a given host should only reference URLs that live on that host. I have an idea, stand by. |
This is to prevent crawlers from reading the sitemap directly from the backend, since it references URLs not on the same host, which is not allowed (see https://www.sitemaps.org/protocol.html#location). This is in preparation for generating the sitemap server-side, but proxying it to the frontend so that the hosts align properly.
This copies the pattern established by the "server-info-build" plugin.
With this change, the backend always generates a sitemap file, but the ENABLE setting governs whether any URL entries are included in it. This simplifies deployment concerns: the backend always generates a sitemap, and the frontend always proxies it to `/sitemap.xml`, but now a deployment not wishing to advertise DLPs will simply not show any of those URLs.
| VITE_APP_DANDI_BACKEND_ROOT=http://localhost:8000 | ||
| VITE_APP_DANDI_API_ROOT=http://localhost:8000/api/ | ||
| VITE_APP_OAUTH_API_ROOT=http://localhost:8000/oauth/ |
There was a problem hiding this comment.
I think these can be refactored now to inject the DANDI and API variables into the config at runtime, but it seemed like it was out of scope for this PR.
|
@jjnesbitt, this is ready for another review. There are some choices in the PR now that may seem a bit odd, so I'm happy to discuss any time. |
This commit restores the original robots.txt view for the backend, and creates a new one specifically for the frontend. It's done this way to make it easier to generate the frontend's sitemap.xml URL within robots.txt (and it piggybacks off the similar proxying trick for sitemap.xml itself).
Co-authored-by: Jacob Nesbitt <jjnesbitt2@gmail.com>
This is to be totally explicit with, e.g., Googlebot that when we're not serving a sitemap file, there isn't even an empty file to present. This matters only for minor reasons, such as avoiding warnings from Google Search Console about an empty sitemap, etc. Or, put another way, it eliminates any possible issues that might crop up when presenting an empty sitemap file to search engines (as opposed to not presenting any file at all).
This PR adds a sitemap.xml view to the DANDI backend that should help with Google presenting search results for individual Dandisets. Together with activating prerendering through Netlify (see #2663), this should solve most of our (current) SEO problems.
Details
Sitemaprecord to visiting crawlersSitemap:directive appears in the robots.txt view.Deployment
When this PR is merged/released, it will not have any immediate effect since the feature defaults to inactive. To activate it on select deployments, we need to set the
DJANGO_DANDI_ENABLE_SITEMAP_XMLenvironment variable toTrue; to do so for our prod deployment, the best way would be to send a PR to thedandi-infrastructurerepo.TODO
Closes #752 (this was already closed, but the current PR addresses it)
Closes #2230