geotessera-registry: Zarr v3 zone stores with global RGB previews#211
Open
avsm wants to merge 29 commits intoucam-eo:mainfrom
Open
geotessera-registry: Zarr v3 zone stores with global RGB previews#211avsm wants to merge 29 commits intoucam-eo:mainfrom
avsm wants to merge 29 commits intoucam-eo:mainfrom
Conversation
Add zone-wide Zarr v3 store support for consolidating per-tile embeddings into efficient, cloud-native stores grouped by UTM zone and year. New modules: - zarr_zone: Core logic for building sharded Zarr v3 stores, RGB previews, global EPSG:4326 preview pyramids, and spatial subsetting - tiles: Format-agnostic tile abstraction supporting npy, geotiff, zarr, and zone_zarr backends New geotessera-registry CLI commands: - zarr-build: Build zone-wide stores from downloaded tile data with shard-first parallel writes using ProcessPoolExecutor - global-preview: Build multiscale EPSG:4326 RGB preview pyramid from per-zone UTM stores, with per-chunk reprojection - stac-index: Generate static STAC catalog from Zarr stores Store layout uses 256x256 shards with 4x4 inner chunks (zstd-compressed), enabling single-pixel HTTP range lookups (~2KB). NaN in scales indicates no-data (water/no coverage). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The zarr-build and global-preview tests take too long to run in CI. Per-tile zarr format tests are retained. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update all Zarr store metadata to comply with the three GeoZarr conventions (proj:, spatial:, multiscales) from zarr-conventions. Zone stores now write: - zarr_conventions registration array - proj:code, proj:wkt2 (replacing crs_epsg, crs_wkt) - spatial:dimensions, spatial:transform, spatial:shape, spatial:bbox, spatial:registration (replacing transform) Global preview stores now write: - spatial: convention in zarr_conventions registration - spatial:dimensions, spatial:bbox (replacing non-standard spatial dict) - Per-level spatial:shape and spatial:transform in multiscale layout - Remove non-standard multiscales.crs key All readers updated to use new namespaced keys. No backwards-compat shims -- use `geotessera-registry zarr-migrate-attrs` to upgrade existing stores in-place. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Track completed zones in global store attrs (_completed_zones) so that a crash/OOM mid-build can be resumed without reprocessing finished zones. The checkpoint is cleared once all zones complete. Add --force flag to global-preview CLI to bypass checkpoints. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace metadata-based zone checkpointing with data scanning: - Zone level: sample 8 interior pixels from level-0; if all have non-zero alpha, skip the zone's reproject + pyramid entirely - Chunk level: each reproject worker reads the target chunk first; if alpha is fully non-zero, skip the expensive rasterio reproject This makes resume automatic with no checkpoint metadata to manage. A partially-completed zone resumes at the chunk level, and fully completed zones are skipped in seconds. --force bypasses both checks to reprocess everything. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace linear O(n) scan of 1.8M chunk files with binary search to find the resume frontier in row-major order. Three fast paths: - Fully done (last chunk exists): O(1) single stat call - Nothing done (first chunk missing): O(1), no per-chunk stats - Partial: O(log n) binary search + linear scan from frontier only Also removes the in-worker alpha channel skip (redundant now that the main process pre-filters) and fixes progress bar to show only remaining work items. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add --skip-reproject to global-preview CLI to skip the reprojection phase and only rebuild pyramids. Useful when reprojection completed but the pyramiding step OOM'd. Replace the pixel-sampling and chunk-file-scanning resume logic with simple .zone_NN_done marker files — O(1) check on resume. The prior approaches failed because Zarr v3 doesn't write files for fill-value chunks, making "no file" ambiguous between "empty" and "unprocessed". Extract _zone_output_bounds() helper so --skip-reproject can compute the output region without opening the zone store. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The coarsening step read full-width row strips from the previous pyramid level. For zone 29's ~400K-column bounding box, each strip was ~1.5 GB as float32. With 24 dask threads, concurrent allocations hit ~36 GB, triggering the OOM killer (118 GB RSS observed). Fix: tile the work in 2D (512x512 output tiles), so each task reads a 1024x1024x4 source region (~4 MB). 24 concurrent tiles ≈ 96 MB. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace dask.compute with ThreadPoolExecutor so each tile completion updates a rich progress bar showing level, tiles done, and ETA. Also drops the dask dependency from this code path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Design spec for consolidating per-zone standalone Zarr stores into a single per-year store with nested groups. Includes migration script design, STAC catalog changes, and tze viewer impact analysis. Implementation plan with 14 tasks covering core layout changes, CLI/STAC updates, migration script, and tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace flat per-zone stores (utm29_2024.zarr, utm30_2024.zarr, global_rgb_2024.zarr) with a consolidated per-year layout (2024.zarr/utm29/, 2024.zarr/utm30/, 2024.zarr/global_rgb/). Key changes: - Replace _store_name() with _year_store_name() + _zone_group_name() - Add _ensure_year_store() for TOCTOU-safe year store creation - create_zone_store() now creates zone groups within year store - _ensure_global_store() creates global_rgb/ group within year store - build_global_preview() discovers zone groups instead of scanning dirs - Reprojection/coarsening use global_rgb/ prefixed array paths - Readers (open_zone_store, read_region_from_zone, add_rgb_to_existing_store) accept year_store_path + zone_group instead of standalone store path - Tile.from_zone_zarr() updated for nested layout - rmtree safety: only ever targets zone group dir or global_rgb, never the year store root Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove dead _zone_store_path assignment in Tile.from_zone_zarr - Make is_available check zone group directory exists (not just year store) - Re-open zarr root after rmtree in _ensure_global_store to avoid stale handle - Use single store handle in _init_reproj_worker instead of opening twice Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…store layout
Update registry_cli.py and scripts/patch_global_bounds.py to work with
the consolidated per-year Zarr store layout ({year}.zarr/utm{NN}/) instead
of standalone per-zone stores (utm{NN}_{year}.zarr).
- stac_index_command: scan {YYYY}.zarr stores, iterate zone groups within
- _zarr_store_to_stac_item: accept zone group handle + metadata instead of
standalone store path; add tessera:dataset_version to properties
- zarr_build_command --rgb-only: discover year stores, iterate zone groups,
call add_rgb_to_existing_store with new (year_store_path, zone_group) API
- Remove _store_matches helper (no longer needed)
- patch_global_bounds.py: look for global_rgb group within year stores
instead of standalone global_rgb_*.zarr files; patch spatial:bbox on the
global_rgb group's metadata within consolidated zarr.json
- Update CLI help strings to reference {year}.zarr layout
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove unused zarr_dir param from _zarr_store_to_stac_item - Cache year store root in stac_index_command to avoid re-opening stores - Fix redundant except (KeyError, Exception) -> except Exception - Remove unnecessary hasattr guard on zarr group attrs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The .zone_*_done markers end up inside the moved zone directories (e.g. 2024.zarr/utm30/.zone_30_done), not at the year store root. Use rglob to find them recursively, and suppress zarr warnings about unrecognized objects during consolidation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…outhern hemisphere coords Remove one-off migration commands (zarr-migrate-attrs, zarr-consolidate), the patch_global_bounds.py script, migrate_store_attrs(), dead code (write_tile_to_store, generate_master_registry), and unused deps (ndpyramid, xproj). Fix zarr.open_group with mode="r+" to use use_consolidated=False, preventing KeyError when creating new zone groups in stores that have stale consolidated metadata from previous builds. Fix _load_from_zone_zarr to canonicalize southern hemisphere northings via northing_to_canonical() to match the store's coordinate system. Fix --rgb-only to respect --year instead of scanning all year stores. Update docs to reflect consolidated store layout and fix stale API examples for read_region_from_zone and open_zone_store. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Read-only zarr.open_group calls also use stale consolidated metadata when enumerating zone groups in a store that's being incrementally built. This caused global-preview and stac-index to miss zones added after the last metadata consolidation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename unprefixed zone/root attributes (utm_zone, year, pixel_size_m, geotessera_version, n_tiles, has_rgb_preview, rgb_bands, rgb_stretch, tessera_dataset_version) to tessera:-prefixed equivalents following the zarr-convention-tessera spec (UUID e7f90d5f-019e-4a38-802f-9fa695e26c71). New properties added to zone groups: - tessera:quantisation — dequantisation contract (method, formula, dtypes) - tessera:n_bands — embedding dimensionality - tessera:model_version — embedding model version (default "1.0") Adds TESSERA_CONVENTION to zarr_conventions on both root and zone groups. Includes backwards-compat _get_tessera_attr() helper that falls back to old unprefixed keys, and a migrate_store_attrs() function + zarr-migrate CLI command to upgrade existing stores in-place. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Match the actual repository location for schema_url and spec_url. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The dequantisation formula is fixed by the tessera convention itself and does not vary per store, so encoding it in every zone group's metadata is redundant. The migration function now removes stale tessera:quantisation attrs from legacy stores. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Xarray requires dimension_names in Zarr v3 array metadata to open datasets. Added missing dimension_names to global_rgb band arrays in new builds, and a migration step that patches zarr.json for all arrays in existing stores (zone-level and global_rgb). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove tessera:pixel_size_m (fixed at 10m for v1, in spatial:transform) - Remove tessera:rgb_bands and tessera:rgb_stretch (RGB preview is opaque) - Migration now cleans up all superseded attrs from legacy stores Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- SPATIAL_CONVENTION: fix description to "Spatial coordinate information" (was "Spatial coordinate transformations and mappings") - PROJ_CONVENTION: fix URLs to zarr-experimental/geo-proj (was zarr-conventions/geo-proj) - Migration now fixes stale convention descriptions and URLs in existing stores via _fix_zarr_conventions() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove .zone_*_done marker files left over from builds - Suppress "Object at .* is not recognized" zarr warnings when iterating root group keys during migration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tessera always uses standard UTM zones so proj:code (e.g. EPSG:32630) is sufficient — clients can derive full WKT2 via pyproj. Migration now removes proj:wkt2 from existing stores. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The root year store group has no proj: or spatial: attributes — those belong on the child zone groups. Declaring them at root causes validation failures for missing required properties. Migration now strips proj: and spatial: from the root zarr_conventions list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add zone-wide Zarr v3 store support for consolidating per-tile embeddings into efficient, cloud-native stores grouped by UTM zone and year.
New modules:
New geotessera-registry CLI commands:
Store layout uses 256x256 shards with 4x4 inner chunks (zstd-compressed), enabling single-pixel HTTP range lookups (~2KB). NaN in scales indicates no-data (water/no coverage).