Skip to content

geotessera-registry: Zarr v3 zone stores with global RGB previews#211

Open
avsm wants to merge 29 commits intoucam-eo:mainfrom
avsm:zarr-convert
Open

geotessera-registry: Zarr v3 zone stores with global RGB previews#211
avsm wants to merge 29 commits intoucam-eo:mainfrom
avsm:zarr-convert

Conversation

@avsm
Copy link
Contributor

@avsm avsm commented Mar 7, 2026

Add zone-wide Zarr v3 store support for consolidating per-tile embeddings into efficient, cloud-native stores grouped by UTM zone and year.

New modules:

  • zarr_zone: Core logic for building sharded Zarr v3 stores, RGB previews, global EPSG:4326 preview pyramids, and spatial subsetting
  • tiles: Format-agnostic tile abstraction supporting npy, geotiff, zarr, and zone_zarr backends

New geotessera-registry CLI commands:

  • zarr-build: Build zone-wide stores from downloaded tile data with shard-first parallel writes using ProcessPoolExecutor
  • global-preview: Build multiscale EPSG:4326 RGB preview pyramid from per-zone UTM stores, with per-chunk reprojection
  • stac-index: Generate static STAC catalog from Zarr stores

Store layout uses 256x256 shards with 4x4 inner chunks (zstd-compressed), enabling single-pixel HTTP range lookups (~2KB). NaN in scales indicates no-data (water/no coverage).

Add zone-wide Zarr v3 store support for consolidating per-tile embeddings
into efficient, cloud-native stores grouped by UTM zone and year.

New modules:
- zarr_zone: Core logic for building sharded Zarr v3 stores, RGB previews,
  global EPSG:4326 preview pyramids, and spatial subsetting
- tiles: Format-agnostic tile abstraction supporting npy, geotiff, zarr,
  and zone_zarr backends

New geotessera-registry CLI commands:
- zarr-build: Build zone-wide stores from downloaded tile data with
  shard-first parallel writes using ProcessPoolExecutor
- global-preview: Build multiscale EPSG:4326 RGB preview pyramid from
  per-zone UTM stores, with per-chunk reprojection
- stac-index: Generate static STAC catalog from Zarr stores

Store layout uses 256x256 shards with 4x4 inner chunks (zstd-compressed),
enabling single-pixel HTTP range lookups (~2KB). NaN in scales indicates
no-data (water/no coverage).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
avsm and others added 28 commits March 7, 2026 12:12
The zarr-build and global-preview tests take too long to run in CI.
Per-tile zarr format tests are retained.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update all Zarr store metadata to comply with the three GeoZarr
conventions (proj:, spatial:, multiscales) from zarr-conventions.

Zone stores now write:
- zarr_conventions registration array
- proj:code, proj:wkt2 (replacing crs_epsg, crs_wkt)
- spatial:dimensions, spatial:transform, spatial:shape, spatial:bbox,
  spatial:registration (replacing transform)

Global preview stores now write:
- spatial: convention in zarr_conventions registration
- spatial:dimensions, spatial:bbox (replacing non-standard spatial dict)
- Per-level spatial:shape and spatial:transform in multiscale layout
- Remove non-standard multiscales.crs key

All readers updated to use new namespaced keys. No backwards-compat
shims -- use `geotessera-registry zarr-migrate-attrs` to upgrade
existing stores in-place.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Track completed zones in global store attrs (_completed_zones) so
that a crash/OOM mid-build can be resumed without reprocessing
finished zones. The checkpoint is cleared once all zones complete.

Add --force flag to global-preview CLI to bypass checkpoints.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace metadata-based zone checkpointing with data scanning:

- Zone level: sample 8 interior pixels from level-0; if all have
  non-zero alpha, skip the zone's reproject + pyramid entirely
- Chunk level: each reproject worker reads the target chunk first;
  if alpha is fully non-zero, skip the expensive rasterio reproject

This makes resume automatic with no checkpoint metadata to manage.
A partially-completed zone resumes at the chunk level, and fully
completed zones are skipped in seconds.

--force bypasses both checks to reprocess everything.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace linear O(n) scan of 1.8M chunk files with binary search to
find the resume frontier in row-major order. Three fast paths:
- Fully done (last chunk exists): O(1) single stat call
- Nothing done (first chunk missing): O(1), no per-chunk stats
- Partial: O(log n) binary search + linear scan from frontier only

Also removes the in-worker alpha channel skip (redundant now that
the main process pre-filters) and fixes progress bar to show only
remaining work items.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add --skip-reproject to global-preview CLI to skip the reprojection
phase and only rebuild pyramids. Useful when reprojection completed
but the pyramiding step OOM'd.

Replace the pixel-sampling and chunk-file-scanning resume logic with
simple .zone_NN_done marker files — O(1) check on resume. The prior
approaches failed because Zarr v3 doesn't write files for fill-value
chunks, making "no file" ambiguous between "empty" and "unprocessed".

Extract _zone_output_bounds() helper so --skip-reproject can compute
the output region without opening the zone store.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The coarsening step read full-width row strips from the previous
pyramid level. For zone 29's ~400K-column bounding box, each strip
was ~1.5 GB as float32.  With 24 dask threads, concurrent allocations
hit ~36 GB, triggering the OOM killer (118 GB RSS observed).

Fix: tile the work in 2D (512x512 output tiles), so each task reads
a 1024x1024x4 source region (~4 MB).  24 concurrent tiles ≈ 96 MB.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace dask.compute with ThreadPoolExecutor so each tile completion
updates a rich progress bar showing level, tiles done, and ETA.
Also drops the dask dependency from this code path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Design spec for consolidating per-zone standalone Zarr stores into
a single per-year store with nested groups. Includes migration script
design, STAC catalog changes, and tze viewer impact analysis.

Implementation plan with 14 tasks covering core layout changes,
CLI/STAC updates, migration script, and tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace flat per-zone stores (utm29_2024.zarr, utm30_2024.zarr,
global_rgb_2024.zarr) with a consolidated per-year layout
(2024.zarr/utm29/, 2024.zarr/utm30/, 2024.zarr/global_rgb/).

Key changes:
- Replace _store_name() with _year_store_name() + _zone_group_name()
- Add _ensure_year_store() for TOCTOU-safe year store creation
- create_zone_store() now creates zone groups within year store
- _ensure_global_store() creates global_rgb/ group within year store
- build_global_preview() discovers zone groups instead of scanning dirs
- Reprojection/coarsening use global_rgb/ prefixed array paths
- Readers (open_zone_store, read_region_from_zone, add_rgb_to_existing_store)
  accept year_store_path + zone_group instead of standalone store path
- Tile.from_zone_zarr() updated for nested layout
- rmtree safety: only ever targets zone group dir or global_rgb, never
  the year store root

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove dead _zone_store_path assignment in Tile.from_zone_zarr
- Make is_available check zone group directory exists (not just year store)
- Re-open zarr root after rmtree in _ensure_global_store to avoid stale handle
- Use single store handle in _init_reproj_worker instead of opening twice

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…store layout

Update registry_cli.py and scripts/patch_global_bounds.py to work with
the consolidated per-year Zarr store layout ({year}.zarr/utm{NN}/) instead
of standalone per-zone stores (utm{NN}_{year}.zarr).

- stac_index_command: scan {YYYY}.zarr stores, iterate zone groups within
- _zarr_store_to_stac_item: accept zone group handle + metadata instead of
  standalone store path; add tessera:dataset_version to properties
- zarr_build_command --rgb-only: discover year stores, iterate zone groups,
  call add_rgb_to_existing_store with new (year_store_path, zone_group) API
- Remove _store_matches helper (no longer needed)
- patch_global_bounds.py: look for global_rgb group within year stores
  instead of standalone global_rgb_*.zarr files; patch spatial:bbox on the
  global_rgb group's metadata within consolidated zarr.json
- Update CLI help strings to reference {year}.zarr layout

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove unused zarr_dir param from _zarr_store_to_stac_item
- Cache year store root in stac_index_command to avoid re-opening stores
- Fix redundant except (KeyError, Exception) -> except Exception
- Remove unnecessary hasattr guard on zarr group attrs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The .zone_*_done markers end up inside the moved zone directories
(e.g. 2024.zarr/utm30/.zone_30_done), not at the year store root.
Use rglob to find them recursively, and suppress zarr warnings about
unrecognized objects during consolidation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…outhern hemisphere coords

Remove one-off migration commands (zarr-migrate-attrs, zarr-consolidate),
the patch_global_bounds.py script, migrate_store_attrs(), dead code
(write_tile_to_store, generate_master_registry), and unused deps
(ndpyramid, xproj).

Fix zarr.open_group with mode="r+" to use use_consolidated=False,
preventing KeyError when creating new zone groups in stores that have
stale consolidated metadata from previous builds.

Fix _load_from_zone_zarr to canonicalize southern hemisphere northings
via northing_to_canonical() to match the store's coordinate system.

Fix --rgb-only to respect --year instead of scanning all year stores.

Update docs to reflect consolidated store layout and fix stale API
examples for read_region_from_zone and open_zone_store.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Read-only zarr.open_group calls also use stale consolidated metadata
when enumerating zone groups in a store that's being incrementally
built. This caused global-preview and stac-index to miss zones added
after the last metadata consolidation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename unprefixed zone/root attributes (utm_zone, year, pixel_size_m,
geotessera_version, n_tiles, has_rgb_preview, rgb_bands, rgb_stretch,
tessera_dataset_version) to tessera:-prefixed equivalents following the
zarr-convention-tessera spec (UUID e7f90d5f-019e-4a38-802f-9fa695e26c71).

New properties added to zone groups:
- tessera:quantisation — dequantisation contract (method, formula, dtypes)
- tessera:n_bands — embedding dimensionality
- tessera:model_version — embedding model version (default "1.0")

Adds TESSERA_CONVENTION to zarr_conventions on both root and zone groups.

Includes backwards-compat _get_tessera_attr() helper that falls back to
old unprefixed keys, and a migrate_store_attrs() function + zarr-migrate
CLI command to upgrade existing stores in-place.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Match the actual repository location for schema_url and spec_url.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The dequantisation formula is fixed by the tessera convention itself
and does not vary per store, so encoding it in every zone group's
metadata is redundant. The migration function now removes stale
tessera:quantisation attrs from legacy stores.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Xarray requires dimension_names in Zarr v3 array metadata to open
datasets. Added missing dimension_names to global_rgb band arrays
in new builds, and a migration step that patches zarr.json for all
arrays in existing stores (zone-level and global_rgb).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove tessera:pixel_size_m (fixed at 10m for v1, in spatial:transform)
- Remove tessera:rgb_bands and tessera:rgb_stretch (RGB preview is opaque)
- Migration now cleans up all superseded attrs from legacy stores

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- SPATIAL_CONVENTION: fix description to "Spatial coordinate information"
  (was "Spatial coordinate transformations and mappings")
- PROJ_CONVENTION: fix URLs to zarr-experimental/geo-proj
  (was zarr-conventions/geo-proj)
- Migration now fixes stale convention descriptions and URLs in
  existing stores via _fix_zarr_conventions()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove .zone_*_done marker files left over from builds
- Suppress "Object at .* is not recognized" zarr warnings when
  iterating root group keys during migration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tessera always uses standard UTM zones so proj:code (e.g. EPSG:32630)
is sufficient — clients can derive full WKT2 via pyproj. Migration
now removes proj:wkt2 from existing stores.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The root year store group has no proj: or spatial: attributes — those
belong on the child zone groups. Declaring them at root causes
validation failures for missing required properties. Migration now
strips proj: and spatial: from the root zarr_conventions list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant