diff --git a/REPOSITORY_OVERVIEW.md b/REPOSITORY_OVERVIEW.md new file mode 100644 index 00000000..300e21b8 --- /dev/null +++ b/REPOSITORY_OVERVIEW.md @@ -0,0 +1,544 @@ +# iSamples in a Box - Repository Overview + +## Table of Contents +1. [Project Purpose](#project-purpose) +2. [Architecture Overview](#architecture-overview) +3. [Key Components](#key-components) +4. [Getting Started](#getting-started) +5. [Data Flow](#data-flow) +6. [Key Scripts and Entry Points](#key-scripts-and-entry-points) +7. [API Endpoints](#api-endpoints) +8. [Development Workflow](#development-workflow) +9. [Testing](#testing) +10. [Deployment](#deployment) + +## Project Purpose + +**iSamples in a Box** (ISB) is a comprehensive Python-based system for aggregating, managing, and providing access to geological and environmental sample metadata from multiple authoritative sources. The system enables researchers and institutions to: + +- **Harvest** sample data from multiple repositories (SESAR, GEOME, Smithsonian, OpenContext) +- **Store** sample records in a PostgreSQL database with full metadata +- **Index** relationships and searchable metadata in Apache Solr for fast querying +- **Expose** data through a REST API using FastAPI +- **Browse** samples through a web UI +- **Mint** identifiers (DataCite DOIs) with ORCID authentication +- **Search** geospatially using H3 hexagon-based heatmaps + +**Current Version:** 0.5.1 +**License:** Apache 2.0 +**Python Version:** 3.11+ + +## Architecture Overview + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Data Sources │ +│ SESAR │ GEOME │ Smithsonian │ OpenContext │ +└────┬────────┬───────────┬──────────────┬─────────────────────┘ + │ │ │ │ + │ Source Adapters (isb_lib/*_adapter.py) + │ │ │ │ + ▼ ▼ ▼ ▼ +┌────────────────────────────────────────────────────────────┐ +│ Metadata Transformers │ +│ (isamples_metadata/*Transformer.py) │ +└────────────────────────┬───────────────────────────────────┘ + │ + ▼ + ┌────────────────────────────────┐ + │ PostgreSQL Database │ + │ (SQLModel ORM - Thing model) │ + └────────────────┬───────────────┘ + │ + ▼ + ┌────────────────────────────────┐ + │ Apache Solr Index │ + │ (isb_core_records collection)│ + └────────────────┬───────────────┘ + │ + ▼ + ┌────────────────────────────────┐ + │ FastAPI Web Service │ + │ (isb_web/main.py) │ + │ - REST API │ + │ - Web UI (Jinja2 templates) │ + │ - Export Service │ + └────────────────────────────────┘ +``` + +## Key Components + +### 1. Core Library (`isb_lib/`) + +The heart of the system, containing business logic and utilities: + +- **`core.py`** (803 lines) - Core utilities including date parsing, validation, vocabulary management +- **Source Adapters**: + - `sesar_adapter.py` - SESAR (System for Earth Sample Registration) + - `geome_adapter.py` - GEOME (Genomic Observatories Metadatabase) + - `smithsonian_adapter.py` - Smithsonian Institution collections + - `opencontext_adapter.py` - Open Context archaeological data +- **`models/`** - SQLModel ORM definitions: + - `thing.py` - Core `Thing` model representing a sample + - `isb_core_record.py` - Extended metadata model + - `export_job.py` - Export job tracking + - `namespace.py` - Identifier namespaces +- **`identifiers/`** - Identifier minting (DataCite DOIs, N2T ARKs) +- **`vocabulary/`** - Controlled vocabulary management +- **`utilities/`** - Helper utilities (H3 geospatial, Solr transformations) +- **`sitemaps/`** - Sitemap generation for search engines +- **`authorization/`** - User authentication and authorization + +### 2. Web Service (`isb_web/`) + +FastAPI-based REST API and web interface: + +- **`main.py`** (931 lines) - Main FastAPI application with all routes +- **`sqlmodel_database.py`** (630 lines) - Database access object (DAO) +- **`isb_solr_query.py`** - Solr query builder and executor +- **`export.py`** - Data export service (CSV, JSONL) +- **`manage.py`** - User and identifier management +- **`auth.py`** - ORCID OAuth authentication +- **`templates/`** - Jinja2 HTML templates for web UI +- **`static/`** - CSS, JavaScript, controlled vocabulary JSON files + +### 3. Metadata Transformation (`isamples_metadata/`) + +Transforms source data to standardized iSamples schema: + +- **Transformers** for each source (SESAR, GEOME, OpenContext, Smithsonian) +- **Controlled vocabularies** for consistent categorization +- **Taxonomy mappings** for biological classifications + +### 4. Scripts (`scripts/`) + +CLI tools for data management (22+ scripts): + +**Main Entry Points:** +- `sesar_things.py` - Load and index SESAR samples +- `geome_things.py` - Load and index GEOME samples +- `opencontext_things.py` - Load OpenContext samples +- `smithsonian_things.py` - Load Smithsonian samples +- `isb_things.py` - General ISB operations + +**Utility Scripts:** +- `dump_thing_json.py` - Export Thing records as JSON +- `create_sql_lite_dump.py` - Create SQLite database dumps +- `load_isamples_vocabularies.py` - Load controlled vocabularies +- `migrations/` - Database migration utilities + +## Getting Started + +### Prerequisites + +- Python 3.11+ +- PostgreSQL +- Apache Solr 8.8+ +- Poetry (Python dependency management) + +### Quick Setup + +1. **Clone and setup Python environment:** +```bash +git clone git@github.com:isamplesorg/isamples_inabox.git +cd isamples_inabox +poetry install +``` + +2. **Setup PostgreSQL:** +```bash +psql postgres +CREATE DATABASE isb_1; +CREATE USER isb_writer WITH ENCRYPTED PASSWORD 'your_password'; +GRANT ALL PRIVILEGES ON DATABASE isb_1 TO isb_writer; +``` + +3. **Setup Solr:** +```bash +solr create -c isb_core_records +python scripts/solr_schema_init/create_isb_core_schema.py +``` + +4. **Create configuration file (`isb.cfg`):** +```ini +db_url = "postgresql+psycopg2://isb_writer:your_password@localhost/isb_1" +solr_url = "http://localhost:8983/solr/isb_core_records/" +max_records = 1000 +verbosity = "INFO" +``` + +5. **Load sample data:** +```bash +poetry run sesar_things --config isb.cfg load -m 5000 +poetry run sesar_things --config isb.cfg relations +``` + +6. **Start web service:** +```bash +python isb_web/main.py +# Navigate to http://localhost:8000/ +``` + +## Data Flow + +### 1. Ingestion Flow + +``` +Source API → Adapter → Transformer → PostgreSQL Thing Table + ↓ + Solr Index (for search) +``` + +**Example: Loading SESAR data** +```bash +poetry run sesar_things --config isb.cfg load -m 5000 +poetry run sesar_things --config isb.cfg relations +``` + +### 2. Query Flow + +``` +User/API Request → FastAPI (isb_web/main.py) + ↓ + Solr Query (for search/filter) + ↓ + PostgreSQL (for full record details) + ↓ + JSON Response +``` + +### 3. Export Flow + +``` +User → Export API (/export/create?q=...&format=CSV) + ↓ + Export Job Created (UUID returned) + ↓ + Background Worker queries Solr + ↓ + Results transformed (SolrResultTransformer) + ↓ + File written (/tmp/{uuid}.csv or .jsonl) + ↓ + User downloads via /export/download?uuid=... +``` + +## Key Scripts and Entry Points + +### Web Service + +```bash +# Start FastAPI server (dev mode) +python isb_web/main.py + +# Production deployment uses uvicorn: +uvicorn isb_web.main:app --host 0.0.0.0 --port 8000 +``` + +### Data Loading (via Poetry) + +```bash +# SESAR samples +poetry run sesar_things --config isb.cfg load -m 5000 +poetry run sesar_things --config isb.cfg relations + +# GEOME samples +poetry run geome_things --config isb.cfg load -m 5000 +poetry run geome_things --config isb.cfg relations + +# OpenContext samples +poetry run opencontext_things --config isb.cfg load + +# Smithsonian samples +poetry run smithsonian_things --config isb.cfg load +``` + +### Utility Scripts + +```bash +# Dump Thing records as JSON +python scripts/dump_thing_json.py -d -a SMITHSONIAN -c 1000 -p /output/path + +# Create SQLite dump +python scripts/create_sql_lite_dump.py --config isb.cfg -q "*:*" + +# Load controlled vocabularies +python scripts/load_isamples_vocabularies.py --config isb.cfg +``` + +## API Endpoints + +The FastAPI service provides multiple API categories: + +### Things API (`/thing`) +- `GET /thing/{identifier}` - Get a specific Thing by identifier +- `GET /thing` - Search Things with filtering + +### Solr API (`/solr`) +- `GET /solr/search` - Direct Solr query interface +- `GET /solr/select` - Solr select handler +- `GET /solr/heatmap` - Get H3 hexagon heatmap data + +### Export API (`/export`) - **Requires ORCID authentication** +- `GET /export/create?q=...&export_format=CSV|JSONL` - Create export job +- `GET /export/status?uuid=...` - Check export job status +- `GET /export/download?uuid=...` - Download completed export + +### Vocabularies API (`/vocabularies`) +- `GET /vocabularies` - List all controlled vocabularies +- `GET /vocabularies/{vocab_name}` - Get specific vocabulary + +### Management API (`/manage`) - **Requires authentication** +- `GET /manage/login` - ORCID OAuth login +- `POST /manage/identifiers` - Mint new identifiers + +### Metrics API (`/metrics`) +- `GET /metrics` - Prometheus-compatible metrics + +## Development Workflow + +### Code Quality Tools + +The project enforces code quality through: + +1. **flake8** - Linting (max complexity 10) +```bash +flake8 isb_lib isb_web scripts tests +``` + +2. **mypy** - Type checking +```bash +mypy isb_lib isb_web scripts +``` + +3. **black** - Code formatting (recommended) +```bash +black isb_lib isb_web scripts tests +``` + +4. **pytest** - Unit testing (71% coverage minimum required) +```bash +pytest --cov --cov-fail-under=71 +``` + +### Git Workflow + +- Main branch: `main` (production) +- Development branch: `develop` +- Feature branches: Create from `develop` +- Pull requests must pass CI/CD checks (GitHub Actions) + +### CI/CD + +GitHub Actions workflows: +- `.github/workflows/python-app.yml` - Unit tests + linting on every PR +- `.github/workflows/python-integration-test.yaml` - Integration tests + +## Testing + +### Unit Tests + +```bash +# Run all tests with coverage +pytest --cov --cov-fail-under=71 + +# Run specific test file +pytest tests/test_core.py + +# Run with verbose output +pytest -v +``` + +### Integration Tests + +Integration tests verify end-to-end functionality: + +```bash +# Run integration tests (requires running Solr + PostgreSQL) +pytest integration_tests/ +``` + +See `docs/indexing_integration_test.md` for details. + +## Deployment + +### Docker Deployment + +The project includes Docker support for containerized deployment: + +```bash +# Build Docker image +docker build -t isamples_inabox . + +# Run with docker-compose (includes PostgreSQL + Solr) +docker-compose up +``` + +### Production Considerations + +1. **Database**: Use managed PostgreSQL service (AWS RDS, Google Cloud SQL) +2. **Solr**: Run in SolrCloud mode with ZooKeeper for high availability +3. **Web Service**: Deploy behind reverse proxy (Nginx) with HTTPS +4. **Secrets**: Use environment variables for sensitive configuration +5. **Monitoring**: Enable Prometheus metrics endpoint (`/metrics`) + +### Environment Variables + +Key environment variables for production: + +```bash +db_url=postgresql+psycopg2://user:pass@host:5432/dbname +solr_url=http://solr-host:8983/solr/isb_core_records/ +ORCID_CLIENT_ID=your_orcid_client_id +ORCID_CLIENT_SECRET=your_orcid_secret +ORCID_ISSUER=https://orcid.org +orcid_superusers=0000-0001-2345-6789,0000-0002-3456-7890 +``` + +## Data Model + +### Core Entity: Thing + +The `Thing` model (in `isb_lib/models/thing.py`) represents a sample: + +**Key Fields:** +- `id` - Globally unique identifier (format: `scheme:value`) +- `authority_id` - Source authority (SESAR, GEOME, etc.) +- `resolved_content` - Full JSON metadata from source +- `resolved_status` - HTTP status of last fetch +- `item_type` - Type of sample +- `tcreated` - Creation timestamp +- `tstamp` - Last update timestamp +- Plus 30+ additional metadata fields + +### ISBCoreRecord + +Extended metadata following iSamples Core schema: +- Sample identifiers and labels +- Geospatial information (lat/lon, elevation, H3 hexagons) +- Sampling context (site, purpose, method) +- Material and specimen classifications +- Curation information +- Related resources + +## Documentation + +Additional documentation in `docs/`: + +- `python_setup.md.html` - Python environment setup +- `authentication_and_identifiers.md` - ORCID OAuth and DOI minting +- `export_service.md` - Export API usage +- `SOLR_Performance_Testing.md` - Performance benchmarking +- `sitemaps_and_transport.md` - Sitemap generation +- `hypothesis_integration.md` - Web annotation integration +- `flat_file_import.md` - CSV import procedures + +## Support and Contributing + +- **Issues**: Report bugs at https://github.com/isamplesorg/isamples_inabox/issues +- **Contributing**: Submit pull requests to `develop` branch +- **License**: Apache 2.0 + +## Related Repositories + +### iSamples Export Client +- **Repository**: https://github.com/isamplesorg/export_client +- **Purpose**: CLI tool for exporting iSamples data with GeoParquet support +- **Key Features**: + - Export to GeoParquet, CSV, and JSONL formats + - STAC metadata generation + - Local web server for viewing exports + - ORCID authentication integration +- **Installation**: `pipx install "git+https://github.com/isamplesorg/export_client.git"` +- **Documentation**: [docs/geoparquet_export_code.md](docs/geoparquet_export_code.md) + +### Other iSamples Repositories +- **Metadata Schemas**: https://github.com/isamplesorg/metadata - Core metadata specifications +- **Vocabularies**: https://github.com/isamplesorg/vocabularies - Controlled vocabularies +- **PQG (Property Graph)**: https://github.com/isamplesorg/pqg - Property graph in DuckDB + +## Common Tasks + +### Add a new sample source + +1. Create adapter in `isb_lib/` (e.g., `newsource_adapter.py`) +2. Create transformer in `isamples_metadata/` (e.g., `NewSourceTransformer.py`) +3. Create CLI script in `scripts/` (e.g., `newsource_things.py`) +4. Add entry point to `pyproject.toml` +5. Update documentation + +### Export data + +#### Option 1: Using the Export Client (Recommended for GeoParquet) + +The **iSamples Export Client** (https://github.com/isamplesorg/export_client) provides a CLI tool with GeoParquet support: + +```bash +# Install export client +pipx install "git+https://github.com/isamplesorg/export_client.git" + +# Login to get JWT +isample login + +# Export to GeoParquet format +export TOKEN="your_jwt_token" +isample export -j $TOKEN -f geoparquet -d /output -q 'source:SESAR' + +# Also supports CSV and JSONL +isample export -j $TOKEN -f csv -d /output -q 'keywords:geology' +``` + +**Export Client Features:** +- **Formats**: JSONL, CSV, and **GeoParquet** (not available via server API) +- **STAC Metadata**: Automatically generates STAC catalog +- **Local Viewer**: Built-in web server to browse exports +- **See**: [docs/geoparquet_export_code.md](docs/geoparquet_export_code.md) for implementation details + +#### Option 2: Direct API Access (CSV/JSONL only) + +```bash +# Via API (requires ORCID authentication) +curl -H "Authorization: Bearer " \ + "https://central.isample.xyz/isamples_central/export/create?q=source:SESAR&export_format=jsonl" + +# Returns: {"status":"created","uuid":"..."} + +# Check status +curl -H "Authorization: Bearer " \ + "https://central.isample.xyz/isamples_central/export/status?uuid=..." + +# Download when complete +curl -H "Authorization: Bearer " \ + "https://central.isample.xyz/isamples_central/export/download?uuid=..." +``` + +### Query samples + +```bash +# Search via Solr API +curl "http://localhost:8000/solr/search?q=keywords:geology&rows=10" + +# Get specific Thing +curl "http://localhost:8000/thing/igsn:XXXXX" + +# Get geospatial heatmap +curl "http://localhost:8000/solr/heatmap?q=*:*&h3_resolution=4" +``` + +## Technology Stack Summary + +- **Language**: Python 3.11+ +- **Web Framework**: FastAPI 0.104.0 + Uvicorn +- **Database**: PostgreSQL (SQLAlchemy/SQLModel ORM) +- **Search**: Apache Solr 8.8+ +- **Authentication**: OAuth2 (ORCID), JWT +- **Geospatial**: Shapely, H3, GeoJSON +- **Data Processing**: PETL, Pandas +- **Testing**: pytest (71% coverage minimum) +- **Dependency Management**: Poetry +- **Code Quality**: flake8, mypy, black + +--- + +**Last Updated**: 2025-11-14 +**Project Repository**: https://github.com/isamplesorg/isamples_inabox diff --git a/docs/geoparquet_export_code.md b/docs/geoparquet_export_code.md new file mode 100644 index 00000000..936ce2a7 --- /dev/null +++ b/docs/geoparquet_export_code.md @@ -0,0 +1,381 @@ +# GeoParquet Export Code - Location and Implementation + +## Summary + +The code that generates the iSamples GeoParquet export file is located in a **separate repository**: + +**Repository**: https://github.com/rdhyee/export_client (also at https://github.com/isamplesorg/export_client) + +## Export Client Overview + +The `export_client` is a Python CLI tool (`isample`) that retrieves content from the iSamples Export Service and provides GeoParquet conversion capabilities. + +### Key Features + +- **CLI Tool**: `isample` command-line interface +- **Authentication**: ORCID OAuth with JWT tokens +- **Export Formats**: JSONL, CSV, and **GeoParquet** +- **STAC Support**: Generates STAC (SpatioTemporal Asset Catalog) metadata +- **Local Server**: Can run a web server to view exported data + +### Installation + +```bash +# Install with pipx +pipx install "git+https://github.com/isamplesorg/export_client.git" + +# Or with Poetry +git clone https://github.com/isamplesorg/export_client.git +cd export_client +poetry install +``` + +## GeoParquet Export Implementation + +### Architecture + +The GeoParquet export follows this workflow: + +``` +1. User runs: isample export -f geoparquet -q "source:SMITHSONIAN" -d /output + ↓ +2. Export Client requests JSONL format from iSamples server + ↓ +3. Server returns JSONL file (one JSON object per line) + ↓ +4. Export Client downloads JSONL file + ↓ +5. Export Client converts JSONL → GeoParquet + ↓ +6. Output: isamples_export_YYYY_MM_DD_HH_MM_SS_geo.parquet +``` + +### Core Code: `geoparquet_utilities.py` + +Location: `isamples_export_client/geoparquet_utilities.py` + +```python +import logging +import os.path + + +def write_geoparquet_from_json_lines(filename: str) -> str: + import pandas as pd + import geopandas as gpd + + logging.info(f"Transforming json lines file at {filename} to geoparquet") + filename_no_extension = os.path.splitext(filename)[0] + + # 1. Read JSONL file with pandas + with open(filename, "r") as json_file: + df = pd.read_json(json_file, lines=True) + + # 2. Extract longitude/latitude from nested "produced_by" field + normalized_produced_by = pd.json_normalize(df["produced_by"]) + df["sample_location_longitude"] = normalized_produced_by["sampling_site.sample_location.longitude"] + df["sample_location_latitude"] = normalized_produced_by["sampling_site.sample_location.latitude"] + + # 3. Create GeoDataFrame with Point geometries + gdf = gpd.GeoDataFrame( + df, + geometry=gpd.points_from_xy( + df.sample_location_longitude, + df.sample_location_latitude + ), + crs="EPSG:4326" # WGS84 coordinate reference system + ) + + # 4. Export to GeoParquet + dest_file = f"{filename_no_extension}_geo.parquet" + gdf.to_parquet(dest_file) + logging.info(f"Wrote geoparquet file to {dest_file}") + return dest_file +``` + +### Key Implementation Details + +1. **Data Source**: Reads from JSONL (JSON Lines) format + - Each line is a complete JSON object representing a sample + - Schema follows iSamples Core metadata specification + +2. **Coordinate Extraction**: + - Uses `pd.json_normalize()` to flatten nested `produced_by` structure + - Extracts: `produced_by.sampling_site.sample_location.longitude` + - Extracts: `produced_by.sampling_site.sample_location.latitude` + +3. **Geometry Creation**: + - Uses `gpd.points_from_xy()` to create Point geometries + - Stores as GeoDataFrame with proper geometry column + +4. **Coordinate Reference System**: + - **CRS**: EPSG:4326 (WGS84) + - Standard geographic coordinate system (latitude/longitude in degrees) + +5. **Output Format**: + - GeoParquet: Apache Parquet with GeoParquet spatial extension + - Filename pattern: `{original_name}_geo.parquet` + +### Integration in Export Client + +Location: `isamples_export_client/export_client.py` (lines 96-101, 452-453) + +```python +class ExportClient: + def __init__(self, ..., format: str, ...): + # When user requests geoparquet format... + if format == "geoparquet": + self._format = "jsonl" # Request JSONL from server + self.is_geoparquet = True + else: + self._format = format + self.is_geoparquet = False + + def perform_full_download(self): + # ... download JSONL file ... + filename = self.download(uuid) + + # Convert to GeoParquet if requested + parquet_filename = None + if self.is_geoparquet: + parquet_filename = write_geoparquet_from_json_lines(filename) +``` + +## Dependencies + +From `pyproject.toml`: + +```toml +[tool.poetry.dependencies] +python = "^3.11" +pandas = "^2.2.2" +geopandas = "^0.14.4" +geoarrow-pyarrow = "^0.1.2" +geoarrow-pandas = "^0.1.1" +duckdb = "^0.10.2" +``` + +Key libraries: +- **pandas** 2.2.2+ - Data manipulation +- **geopandas** 0.14.4+ - Geographic data operations +- **geoarrow-pyarrow** 0.1.2+ - Arrow/Parquet geographic data +- **duckdb** 0.10.2+ - For querying exported data + +## Usage Example + +### Command Line + +```bash +# 1. Login to get JWT token +isample login +# Browser opens for ORCID authentication +# Copy the JWT token + +# 2. Export to GeoParquet +export TOKEN="your_jwt_token_here" + +isample export \ + -j $TOKEN \ + -f geoparquet \ + -d /output/directory \ + -q 'source:SMITHSONIAN' +``` + +### What Gets Created + +The export creates a directory structure like: + +``` +/output/directory/ +└── 2025_04_21_16_23_46/ + ├── isamples_export_2025_04_21_16_23_46.jsonl # Original JSONL + ├── isamples_export_2025_04_21_16_23_46_geo.parquet # GeoParquet! + ├── manifest.json # Export metadata + └── stac.json # STAC metadata +``` + +### Output File Details + +**GeoParquet File**: `isamples_export_2025_04_21_16_23_46_geo.parquet` + +This file contains: +- All sample metadata fields from iSamples Core schema +- A `geometry` column with Point geometries +- Coordinate columns: `sample_location_latitude`, `sample_location_longitude` +- Full nested JSON structures preserved (produced_by, curation, etc.) +- Efficient columnar storage (Parquet format) +- Geographic metadata (GeoParquet specification) + +## Zenodo Export File + +The file available at https://zenodo.org/records/15278211/files/isamples_export_2025_04_21_16_23_46_geo.parquet +was created using this exact process: + +```bash +# Likely command used: +isample export \ + -j $TOKEN \ + -f geoparquet \ + -d /tmp \ + -q '*:*' # Export all records +``` + +## Data Schema + +### Input JSONL Schema (iSamples Core) + +Each line in the JSONL file contains a sample record like: + +```json +{ + "sample_identifier": "IGSN:BSU0005H1", + "@id": "https://isample.org/thing/BSU0005H1", + "label": "BJJ-4487", + "description": "...", + "source_collection": "SESAR", + "has_specimen_category": [...], + "has_material_category": [...], + "has_context_category": [...], + "keywords": [...], + "produced_by": { + "identifier": "event_id", + "label": "Event label", + "result_time": "2019-09-10T03:41:45Z", + "sampling_site": { + "label": "Site name", + "place_name": ["Arizona", "USA"], + "sample_location": { + "latitude": 31.8854, + "longitude": -110.7733, + "elevation": 1200.0 + } + }, + "responsibility": [...] + }, + "curation": {...}, + "registrant": {...} +} +``` + +### Output GeoParquet Schema + +The GeoParquet file has: + +1. **All original JSONL fields** (preserved as-is) +2. **Additional extracted fields**: + - `sample_location_latitude` (float64) + - `sample_location_longitude` (float64) +3. **Geometry column**: + - Name: `geometry` + - Type: Point (2D) + - CRS: EPSG:4326 + +## Why This Architecture? + +The design choice to keep GeoParquet conversion **client-side** has several benefits: + +1. **Server Simplicity**: iSamples server only needs to support JSONL and CSV +2. **Flexibility**: Client can add new formats without server changes +3. **Bandwidth**: JSONL is more compact than GeoParquet for transmission +4. **Local Control**: Users can customize conversion if needed +5. **STAC Integration**: Client generates STAC metadata alongside GeoParquet + +## Comparison with isamples_inabox Export Service + +### isamples_inabox (Server) +- **Location**: `isb_web/export.py`, `isb_lib/utilities/solr_result_transformer.py` +- **Formats**: CSV, JSONL only +- **Architecture**: Server-side transformation +- **Output**: File available via API endpoint +- **Dependencies**: petl, no geographic libraries + +### export_client (Client) +- **Location**: `isamples_export_client/geoparquet_utilities.py` +- **Formats**: CSV, JSONL, GeoParquet +- **Architecture**: Client-side transformation (JSONL → GeoParquet) +- **Output**: Local file with STAC metadata +- **Dependencies**: pandas, geopandas, geoarrow + +## Extending the Export + +### To Add GeoParquet Support to isamples_inabox Server + +If you wanted to add native GeoParquet support to the server, you would: + +1. **Add dependencies** to `requirements.txt`: + ``` + geopandas>=0.14.4 + pyarrow>=10.0.0 + ``` + +2. **Update `TargetExportFormat` enum** in `isb_lib/utilities/solr_result_transformer.py`: + ```python + class TargetExportFormat(Enum): + CSV = "CSV" + JSONL = "JSONL" + GEOPARQUET = "GEOPARQUET" # Add this + ``` + +3. **Create `GeoParquetExportTransformer`** class: + ```python + class GeoParquetExportTransformer(AbstractExportTransformer): + @staticmethod + def transform(table: Table, dest_path_no_extension: str, append: bool) -> list[str]: + import pandas as pd + import geopandas as gpd + + # Convert petl table to pandas DataFrame + df = pd.DataFrame(table.dicts()) + + # Extract coordinates + lat = df[SOLR_PRODUCED_BY_SAMPLING_SITE_LOCATION_LATITUDE] + lon = df[SOLR_PRODUCED_BY_SAMPLING_SITE_LOCATION_LONGITUDE] + + # Create GeoDataFrame + gdf = gpd.GeoDataFrame( + df, + geometry=gpd.points_from_xy(lon, lat), + crs="EPSG:4326" + ) + + # Export + dest_path = f"{dest_path_no_extension}.parquet" + gdf.to_parquet(dest_path) + return [dest_path] + ``` + +However, the current client-side approach is probably better for the reasons listed above. + +## Additional Resources + +- **Export Client Repository**: https://github.com/isamplesorg/export_client +- **Export Client Documentation**: https://github.com/isamplesorg/export_client/blob/main/README.md +- **iSamples Export Service Docs**: https://github.com/isamplesorg/isamples_inabox/blob/develop/docs/export_service.md +- **GeoParquet Specification**: https://geoparquet.org/ +- **iSamples Core Schema**: https://github.com/isamplesorg/metadata + +## Testing the Export Code + +```bash +# Clone the export_client repository +git clone https://github.com/isamplesorg/export_client.git +cd export_client + +# Install dependencies +poetry install + +# Run tests +poetry run pytest + +# Test GeoParquet conversion directly +poetry run python -c " +from isamples_export_client.geoparquet_utilities import write_geoparquet_from_json_lines +result = write_geoparquet_from_json_lines('test_data.jsonl') +print(f'Created: {result}') +" +``` + +--- + +**Document Updated**: 2025-11-14 +**Export Client Version**: 0.2.2 +**Repository**: https://github.com/rdhyee/export_client diff --git a/docs/geoparquet_export_findings.md b/docs/geoparquet_export_findings.md new file mode 100644 index 00000000..284e2afb --- /dev/null +++ b/docs/geoparquet_export_findings.md @@ -0,0 +1,214 @@ +# GeoParquet Export Code - Investigation Findings + +## Summary + +An investigation was conducted to locate the code that generates the iSamples GeoParquet export file available at: +https://zenodo.org/records/15278211/files/isamples_export_2025_04_21_16_23_46_geo.parquet + +## Investigation Results + +**🎉 UPDATE: CODE FOUND!** + +The GeoParquet export code is located in a **separate repository**: + +**Repository**: https://github.com/rdhyee/export_client (also at https://github.com/isamplesorg/export_client) + +**See**: [geoparquet_export_code.md](./geoparquet_export_code.md) for complete documentation of the export implementation. + +--- + +## Original Investigation + +**Initial Status**: The GeoParquet export code was **NOT FOUND** in the current `isamples_inabox` repository. + +## Search Methods Used + +1. **Pattern Matching**: Searched for keywords including: + - `geoparquet`, `GeoParquet`, `geo.parquet` + - `parquet`, `Parquet`, `PARQUET` + - `pyarrow`, `arrow`, `geopandas`, `gpd.to_parquet` + - `to_parquet` (the typical method for writing parquet files) + +2. **File Inspection**: Examined key files: + - `isb_web/export.py` - Main export service (only supports CSV and JSONL) + - `isb_lib/utilities/solr_result_transformer.py` - Export transformers (only CSV and JSONL) + - All scripts in `scripts/` directory + - Jupyter notebooks in `notes/` directory + +3. **Git History**: Searched commit history for export-related changes + +4. **Dependency Analysis**: Checked for parquet-related libraries in requirements + +## Current Export Capabilities + +The `isamples_inabox` repository **currently supports only two export formats**: + +### 1. CSV Export +- **Class**: `CSVExportTransformer` in `isb_lib/utilities/solr_result_transformer.py:61-69` +- **Method**: Uses `petl.io.csv.tocsv()` or `petl.io.csv.appendcsv()` +- **Output**: Flat CSV file with renamed columns + +### 2. JSONL Export (JSON Lines) +- **Class**: `JSONExportTransformer` in `isb_lib/utilities/solr_result_transformer.py:72-132` +- **Method**: Writes one JSON object per line +- **Output**: Structured JSON following iSamples metadata schema + +### Export Format Enum +```python +# From isb_lib/utilities/solr_result_transformer.py:38-50 +class TargetExportFormat(Enum): + """Valid target export formats""" + CSV = "CSV" + JSONL = "JSONL" +``` + +**Notable Absence**: No `PARQUET` or `GEOPARQUET` format option exists. + +## Export Service Architecture + +The current export service (`isb_web/export.py`) works as follows: + +1. User creates export job via API: `/export/create?q=...&export_format=CSV|JSONL` +2. Export job queued in database (`ExportJob` model) +3. Background worker queries Solr +4. `SolrResultTransformer` converts results to target format +5. File written to `/tmp/{uuid}.csv` or `.jsonl` +6. User downloads via `/export/download?uuid=...` + +## Likely Origins of GeoParquet Export + +Given the investigation results, the GeoParquet file was most likely created using **ONE** of the following methods: + +### Hypothesis 1: External Script (Most Likely) +A standalone Python script was created **outside the main repository** to: +1. Query the iSamples Solr index or PostgreSQL database +2. Fetch sample records with geospatial coordinates +3. Use `geopandas` to create GeoDataFrame +4. Export to GeoParquet using `geopandas.GeoDataFrame.to_parquet()` + +**Typical code pattern:** +```python +import geopandas as gpd +from shapely.geometry import Point +import pandas as pd + +# Query database/Solr for samples +samples = fetch_samples() # Custom function + +# Create geometry column +geometry = [Point(xy) for xy in zip(samples['longitude'], samples['latitude'])] +gdf = gpd.GeoDataFrame(samples, geometry=geometry, crs='EPSG:4326') + +# Export to GeoParquet +gdf.to_parquet('isamples_export_2025_04_21_16_23_46_geo.parquet') +``` + +### Hypothesis 2: Different Repository/Branch +The code may exist in: +- A different branch not checked out +- A separate repository for data exports/analytics +- A private/internal repository +- A personal development repository + +### Hypothesis 3: One-Time Script +The export may have been created using an ad-hoc script that was: +- Run manually on the server +- Not committed to version control +- Deleted after execution +- Created for a specific publication/dataset release + +### Hypothesis 4: Notebook-Based Export +The export may have been created in a Jupyter notebook that: +- Connected directly to the database +- Performed custom transformations +- Exported to GeoParquet +- Was not committed to the repository + +## Recommendations + +### To Locate the Original Code: + +1. **Ask the team member who created the Zenodo upload** + - Check Zenodo metadata for uploader information + - Ask about the script/method used + +2. **Check server/production environments** + - Look in `/home/` directories for user scripts + - Check cron jobs or scheduled tasks + - Search for `*.py` files with "parquet" in content + +3. **Search other repositories** + - Check `isamplesorg` GitHub organization for related repos + - Look for data analysis or export-specific repositories + +4. **Check documentation/notes** + - Look for data release documentation + - Check for README files describing export process + +### To Recreate the Export: + +If the original code cannot be found, a new GeoParquet export can be created by: + +1. **Extending the existing export service** (Recommended) + - Add `PARQUET` and `GEOPARQUET` to `TargetExportFormat` enum + - Create `ParquetExportTransformer` class + - Create `GeoParquetExportTransformer` class using `geopandas` + - Update export API to support new formats + +2. **Creating a standalone script** (Quick solution) + - Query Solr or PostgreSQL directly + - Transform to GeoDataFrame + - Export to GeoParquet + - See `docs/geoparquet_to_pqg_conversion_plan.md` for reference + +## Required Dependencies for GeoParquet Export + +To create GeoParquet exports, these packages would be needed (not currently in requirements): + +``` +geopandas>=0.14.0 +pyarrow>=10.0.0 +shapely>=2.0.0 +``` + +Current `requirements.txt` includes: +- ✓ `shapely==2.0.2` - For geometry creation +- ✗ `geopandas` - NOT present (would be needed) +- ✗ `pyarrow` - NOT present (would be needed for Parquet) + +## Investigation Statistics + +- **Files searched**: 153+ Python files +- **Keywords searched**: 8 different patterns +- **Directories examined**: All major directories (`isb_lib`, `isb_web`, `scripts`, `notes`) +- **Git commits reviewed**: 20+ export-related commits +- **Time spent**: Comprehensive search of codebase + +## Conclusion + +### Original Conclusion (Before Finding Code) +The GeoParquet export code does **not exist in the current `isamples_inabox` repository**. The file was most likely created using: +1. An external standalone script (most probable) ✅ **CORRECT** +2. A Jupyter notebook +3. Code in a different repository or branch ✅ **CORRECT** +4. An ad-hoc one-time export script + +### Final Conclusion (After Finding Code) + +**The investigation was correct!** The GeoParquet export code exists in a **separate repository**: https://github.com/rdhyee/export_client + +**Key Findings:** +- The `export_client` repository contains a CLI tool (`isample`) for exporting iSamples data +- GeoParquet conversion is implemented in `isamples_export_client/geoparquet_utilities.py` +- The export process: Server provides JSONL → Client converts to GeoParquet +- Uses pandas, geopandas, and pyarrow for the conversion +- The Zenodo file was created using: `isample export -f geoparquet -q '*:*'` + +**Documentation**: See [geoparquet_export_code.md](./geoparquet_export_code.md) for complete implementation details. + +--- + +**Investigation Date**: 2025-11-14 +**Code Found Date**: 2025-11-14 +**Repository Commit**: f8fd9d4 +**Investigator**: Claude (AI Assistant) diff --git a/docs/geoparquet_to_pqg_conversion_plan.md b/docs/geoparquet_to_pqg_conversion_plan.md new file mode 100644 index 00000000..27dc9757 --- /dev/null +++ b/docs/geoparquet_to_pqg_conversion_plan.md @@ -0,0 +1,966 @@ +# Conversion Plan: iSamples GeoParquet to PQG Format + +## Overview + +This document provides a detailed plan for converting the iSamples GeoParquet export file +(`isamples_export_2025_04_21_16_23_46_geo.parquet`) into the PQG (Property Graph in DuckDB) format +as documented at https://github.com/isamplesorg/pqg. + +**📝 Note**: The GeoParquet export code was located at https://github.com/rdhyee/export_client. +See [geoparquet_export_code.md](./geoparquet_export_code.md) for details on how the GeoParquet files are created. + +## Background + +### Source Format: GeoParquet +- **File**: `isamples_export_2025_04_21_16_23_46_geo.parquet` (available on Zenodo: https://zenodo.org/records/15278211) +- **Format**: Apache Parquet with GeoParquet spatial extension +- **Content**: iSamples sample metadata including geospatial coordinates +- **Schema**: Based on iSamples Core metadata schema (see `isb_lib/models/isb_core_record.py`) +- **Creation Tool**: Generated using the `isample` CLI from https://github.com/rdhyee/export_client +- **Process**: Server exports JSONL → Client converts to GeoParquet (see `geoparquet_utilities.py`) + +### Target Format: PQG +- **Library**: Python library for property graphs using DuckDB backend +- **Architecture**: Single-table design with nodes and edges +- **Requirements**: Python 3.11+, dataclasses-based models +- **Graph Model**: Nodes (entities) with properties + Edges (relationships) + +## Understanding PQG Structure + +### PQG Nodes Structure +Each node in PQG contains: +- `row_id`: Auto-incrementing primary key +- `pid`: Unique persistent identifier (string) +- `otype`: Object/node type classification +- `label`: Human-readable name +- `description`: Optional text description +- `altids`: Alternative identifiers (list) +- Custom properties as defined by dataclass + +### PQG Edges Structure +Edges follow Subject-Predicate-Object model: +- `s`: Source node reference (internal integer ID) +- `p`: Relationship/predicate type (string) +- `o`: Target node reference(s) - array of integer IDs +- `n`: Optional named graph designation + +### Key PQG Features +- **Automatic decomposition**: Nested objects become separate nodes with edges +- **Geographic support**: Spatial data can be included +- **Export formats**: Parquet, GeoJSON, Graphviz +- **Columnar storage**: Fast queries via DuckDB + +## iSamples Data Model Analysis + +### Core Entity: Sample (Thing) + +Based on `isb_lib/models/isb_core_record.py` and the export service, each sample contains: + +**Primary Identifiers:** +- `sample_identifier` (id) - Main sample ID (e.g., "IGSN:BSU0005H1") +- `@id` (isb_core_id) - iSamples internal identifier +- `source_collection` - Source authority (SESAR, GEOME, etc.) + +**Descriptive Metadata:** +- `label` - Short name/label +- `description` - Full description +- `keywords` - List of keywords +- `informal_classification` - Free-text classification + +**Controlled Vocabularies:** +- `has_specimen_category` - Sample object type (array) +- `has_material_category` - Material classification (array) +- `has_context_category` - Geological context (array) + +**Sampling Event (produced_by):** +- `identifier` - Sampling event ID +- `label`, `description` - Event metadata +- `result_time` - When sample was collected +- `has_feature_of_interest` - What was sampled +- `responsibility` - Array of {role, name} objects (collectors, owners) +- `sampling_site` - Nested location information: + - `place_name` - Array of place names + - `label`, `description` - Site metadata + - `sample_location`: + - `latitude`, `longitude` - Coordinates (decimal degrees) + - `elevation` - Elevation in meters + +**Curation:** +- `label`, `description` - Curation information +- `curation_location` - Where sample is stored +- `responsibility` - Curators (array) +- `access_constraints` - Access restrictions + +**Administrative:** +- `registrant` - {name} who registered the sample +- `sampling_purpose` - Purpose of sampling +- `related_resource` - Links to related resources +- `authorized_by`, `complies_with` - Authorization info +- `last_modified_time` - Source update timestamp + +## Conversion Strategy + +### Graph Model Design + +The iSamples data will be decomposed into a property graph with the following node types and relationships: + +``` +┌──────────────┐ +│ Sample │ +│ (otype: │ +│ "Sample") │ +└──────┬───────┘ + │ + │ has_material_category + ├──────────────────────────────► ┌───────────────────┐ + │ │ MaterialCategory │ + │ has_specimen_category │ (otype: │ + ├──────────────────────────────► │ "Vocabulary") │ + │ └───────────────────┘ + │ has_context_category + ├──────────────────────────────► ┌───────────────────┐ + │ │ ContextCategory │ + │ │ (otype: │ + │ │ "Vocabulary") │ + │ └───────────────────┘ + │ produced_by + ├──────────────────────────────► ┌───────────────────┐ + │ │ SamplingEvent │ + │ │ (otype: │ + │ │ "Event") │ + │ └─────────┬─────────┘ + │ │ + │ │ at_site + │ ├────────► ┌──────────────┐ + │ │ │ SamplingSite │ + │ │ │ (otype: │ + │ │ │ "Place") │ + │ │ │ + geometry │ + │ │ └──────────────┘ + │ │ + │ │ has_responsibility + │ └────────► ┌──────────────┐ + │ │ Person/Org │ + │ │ (otype: │ + │ │ "Agent") │ + │ └──────────────┘ + │ curated_by + ├──────────────────────────────► ┌───────────────────┐ + │ │ Curation │ + │ │ (otype: │ + │ │ "Activity") │ + │ └───────────────────┘ + │ registered_by + └──────────────────────────────► ┌───────────────────┐ + │ Registrant │ + │ (otype: "Agent") │ + └───────────────────┘ +``` + +### Node Types (otype values) + +1. **Sample** - Core sample entity +2. **SamplingEvent** - The event that produced the sample +3. **SamplingSite** - Geographic location (with geometry) +4. **Person** or **Organization** - Agents (collectors, curators, registrants) +5. **VocabularyTerm** - Controlled vocabulary terms (material, specimen, context categories) +6. **Curation** - Curation activity +7. **Keyword** - Keywords for search +8. **RelatedResource** - Links to external resources + +### Relationship Types (predicate values) + +- `produced_by` - Sample → SamplingEvent +- `at_site` - SamplingEvent → SamplingSite +- `has_responsibility` - Event/Curation → Person/Organization (with role property) +- `has_material_category` - Sample → VocabularyTerm +- `has_specimen_category` - Sample → VocabularyTerm +- `has_context_category` - Sample → VocabularyTerm +- `has_keyword` - Sample → Keyword +- `curated_by` - Sample → Curation +- `registered_by` - Sample → Person/Organization +- `related_to` - Sample → RelatedResource + +## Implementation Steps + +### Phase 1: Setup and Dependencies + +1. **Install required packages:** +```bash +pip install duckdb pyarrow geopandas pqg +``` + +2. **Create project structure:** +``` +conversion_project/ +├── src/ +│ ├── models.py # PQG dataclass definitions +│ ├── loader.py # Load GeoParquet data +│ ├── transformer.py # Transform to PQG format +│ └── exporter.py # Export PQG graph +├── scripts/ +│ └── convert.py # Main conversion script +├── tests/ +│ └── test_conversion.py # Unit tests +└── README.md +``` + +### Phase 2: Define PQG Data Models + +Create dataclass models in `src/models.py`: + +```python +from dataclasses import dataclass, field +from typing import Optional, List +from pqg import Base + +@dataclass +class Sample(Base): + """Main sample node""" + pid: str # sample_identifier + otype: str = "Sample" + label: str = "" + description: str = "" + altids: List[str] = field(default_factory=list) # e.g., isb_core_id + source_collection: str = "" + informal_classification: List[str] = field(default_factory=list) + last_modified_time: Optional[str] = None + +@dataclass +class SamplingEvent(Base): + """Sampling event that produced the sample""" + pid: str # Constructed from sample_id + "_event" + otype: str = "SamplingEvent" + label: str = "" + description: str = "" + result_time: Optional[str] = None + has_feature_of_interest: str = "" + +@dataclass +class SamplingSite(Base): + """Geographic location with spatial data""" + pid: str # Constructed from coordinates or site_label + otype: str = "SamplingSite" + label: str = "" + description: str = "" + place_names: List[str] = field(default_factory=list) + latitude: Optional[float] = None + longitude: Optional[float] = None + elevation: Optional[float] = None + # PQG supports geometry - can store as WKT or GeoJSON + geometry: Optional[str] = None + +@dataclass +class Agent(Base): + """Person or organization""" + pid: str # Name-based or unique ID + otype: str = "Agent" # Could be "Person" or "Organization" + label: str = "" + role: Optional[str] = None # Role in specific context + +@dataclass +class VocabularyTerm(Base): + """Controlled vocabulary term""" + pid: str # Vocabulary identifier URI + otype: str = "VocabularyTerm" + label: str = "" + category: str = "" # "material", "specimen", or "context" + +@dataclass +class Curation(Base): + """Curation information""" + pid: str # Constructed from sample + curation info + otype: str = "Curation" + label: str = "" + description: str = "" + location: str = "" + access_constraints: List[str] = field(default_factory=list) + +@dataclass +class Keyword(Base): + """Keyword for search""" + pid: str # The keyword itself + otype: str = "Keyword" + label: str = "" +``` + +### Phase 3: Load GeoParquet Data + +Create data loader in `src/loader.py`: + +```python +import geopandas as gpd +import pyarrow.parquet as pq + +class GeoParquetLoader: + """Load iSamples GeoParquet export""" + + def __init__(self, parquet_path: str): + self.parquet_path = parquet_path + + def load(self) -> gpd.GeoDataFrame: + """Load GeoParquet file as GeoDataFrame""" + gdf = gpd.read_parquet(self.parquet_path) + print(f"Loaded {len(gdf)} samples") + print(f"Columns: {gdf.columns.tolist()}") + return gdf + + def get_schema(self): + """Examine parquet schema""" + parquet_file = pq.ParquetFile(self.parquet_path) + return parquet_file.schema +``` + +### Phase 4: Transform to PQG Format + +Create transformer in `src/transformer.py`: + +```python +from typing import List, Dict, Set +import json +from pqg import Graph +from .models import ( + Sample, SamplingEvent, SamplingSite, Agent, + VocabularyTerm, Curation, Keyword +) + +class ISamplesToPQGTransformer: + """Transform iSamples data to PQG property graph""" + + def __init__(self): + self.graph = Graph() + self.seen_pids: Set[str] = set() # Track created nodes + + def transform_sample(self, row: dict) -> Sample: + """Transform a single sample record to Sample node""" + sample = Sample( + pid=row['sample_identifier'], + label=row.get('label', ''), + description=row.get('description', ''), + altids=[row.get('@id', '')], # isb_core_id as altid + source_collection=row.get('source_collection', ''), + informal_classification=self._to_list( + row.get('informal_classification', []) + ), + last_modified_time=row.get('last_modified_time') + ) + self.graph.add_node(sample) + return sample + + def transform_sampling_event(self, sample_pid: str, + produced_by: dict) -> SamplingEvent: + """Transform sampling event from produced_by field""" + event_pid = produced_by.get('identifier', + f"{sample_pid}_event") + + event = SamplingEvent( + pid=event_pid, + label=produced_by.get('label', ''), + description=produced_by.get('description', ''), + result_time=produced_by.get('result_time'), + has_feature_of_interest=produced_by.get( + 'has_feature_of_interest', '' + ) + ) + self.graph.add_node(event) + + # Create edge: Sample produced_by SamplingEvent + self.graph.add_edge(sample_pid, 'produced_by', event_pid) + + return event + + def transform_sampling_site(self, event_pid: str, + sampling_site: dict) -> SamplingSite: + """Transform sampling site with geographic data""" + # Use coordinates or label to create unique PID + lat = sampling_site.get('sample_location', {}).get('latitude') + lon = sampling_site.get('sample_location', {}).get('longitude') + + if lat and lon: + site_pid = f"site_{lat}_{lon}" + else: + site_pid = f"site_{sampling_site.get('label', 'unknown')}" + + # Create Point geometry if coordinates available + geometry = None + if lat and lon: + geometry = f"POINT({lon} {lat})" # WKT format + + site = SamplingSite( + pid=site_pid, + label=sampling_site.get('label', ''), + description=sampling_site.get('description', ''), + place_names=self._to_list(sampling_site.get('place_name', [])), + latitude=lat, + longitude=lon, + elevation=sampling_site.get('sample_location', {}).get( + 'elevation' + ), + geometry=geometry + ) + + if site_pid not in self.seen_pids: + self.graph.add_node(site) + self.seen_pids.add(site_pid) + + # Create edge: SamplingEvent at_site SamplingSite + self.graph.add_edge(event_pid, 'at_site', site_pid) + + return site + + def transform_agents(self, context_pid: str, + relationship_type: str, + responsibilities: List[dict]): + """Transform responsibility records to Agent nodes""" + for resp in responsibilities: + name = resp.get('name', '') + role = resp.get('role', '') + + # Create agent PID from name (could enhance with ORCID if available) + agent_pid = f"agent_{name.replace(' ', '_').lower()}" + + if agent_pid not in self.seen_pids: + agent = Agent( + pid=agent_pid, + label=name, + role=role + ) + self.graph.add_node(agent) + self.seen_pids.add(agent_pid) + + # Create edge with role as property + self.graph.add_edge( + context_pid, + relationship_type, + agent_pid, + properties={'role': role} + ) + + def transform_vocabulary_terms(self, sample_pid: str, + terms: List[dict], + category: str, + relationship: str): + """Transform controlled vocabulary terms""" + for term in terms: + term_id = term.get('identifier', '') + if not term_id: + continue + + term_pid = term_id # Use vocabulary URI as PID + + if term_pid not in self.seen_pids: + vocab_term = VocabularyTerm( + pid=term_pid, + label=term_id.split('/')[-1], # Extract label from URI + category=category + ) + self.graph.add_node(vocab_term) + self.seen_pids.add(term_pid) + + # Create edge: Sample → VocabularyTerm + self.graph.add_edge(sample_pid, relationship, term_pid) + + def transform_keywords(self, sample_pid: str, keywords: List[dict]): + """Transform keywords""" + for kw in keywords: + keyword_text = kw.get('keyword', '') + if not keyword_text: + continue + + kw_pid = f"keyword_{keyword_text.lower().replace(' ', '_')}" + + if kw_pid not in self.seen_pids: + keyword = Keyword( + pid=kw_pid, + label=keyword_text + ) + self.graph.add_node(keyword) + self.seen_pids.add(kw_pid) + + self.graph.add_edge(sample_pid, 'has_keyword', kw_pid) + + def transform_curation(self, sample_pid: str, curation: dict): + """Transform curation information""" + if not curation or not any(curation.values()): + return # Skip empty curation + + curation_pid = f"{sample_pid}_curation" + + curation_node = Curation( + pid=curation_pid, + label=curation.get('label', ''), + description=curation.get('description', ''), + location=curation.get('curation_location', ''), + access_constraints=self._to_list( + curation.get('access_constraints', []) + ) + ) + self.graph.add_node(curation_node) + + # Create edge: Sample curated_by Curation + self.graph.add_edge(sample_pid, 'curated_by', curation_pid) + + # Transform curators as agents + if 'responsibility' in curation: + self.transform_agents( + curation_pid, + 'has_curator', + curation['responsibility'] + ) + + def transform_row(self, row: dict): + """Transform a single GeoParquet row to graph nodes/edges""" + # Parse JSON fields if they're strings + row = self._parse_json_fields(row) + + # 1. Create Sample node + sample = self.transform_sample(row) + sample_pid = sample.pid + + # 2. Transform produced_by (sampling event and site) + if 'produced_by' in row and row['produced_by']: + produced_by = row['produced_by'] + event = self.transform_sampling_event(sample_pid, produced_by) + + # 3. Transform sampling site + if 'sampling_site' in produced_by: + self.transform_sampling_site( + event.pid, + produced_by['sampling_site'] + ) + + # 4. Transform event responsibilities (collectors, etc.) + if 'responsibility' in produced_by: + self.transform_agents( + event.pid, + 'has_responsibility', + produced_by['responsibility'] + ) + + # 5. Transform vocabulary terms + if 'has_specimen_category' in row: + self.transform_vocabulary_terms( + sample_pid, + row['has_specimen_category'], + 'specimen', + 'has_specimen_category' + ) + + if 'has_material_category' in row: + self.transform_vocabulary_terms( + sample_pid, + row['has_material_category'], + 'material', + 'has_material_category' + ) + + if 'has_context_category' in row: + self.transform_vocabulary_terms( + sample_pid, + row['has_context_category'], + 'context', + 'has_context_category' + ) + + # 6. Transform keywords + if 'keywords' in row: + self.transform_keywords(sample_pid, row['keywords']) + + # 7. Transform curation + if 'curation' in row: + self.transform_curation(sample_pid, row['curation']) + + # 8. Transform registrant + if 'registrant' in row and row['registrant']: + registrant = row['registrant'] + if isinstance(registrant, dict): + name = registrant.get('name', '') + agent_pid = f"agent_{name.replace(' ', '_').lower()}" + + if agent_pid not in self.seen_pids: + agent = Agent(pid=agent_pid, label=name) + self.graph.add_node(agent) + self.seen_pids.add(agent_pid) + + self.graph.add_edge( + sample_pid, + 'registered_by', + agent_pid + ) + + def _to_list(self, value): + """Ensure value is a list""" + if isinstance(value, str): + return [value] + elif isinstance(value, list): + return value + else: + return [] + + def _parse_json_fields(self, row: dict) -> dict: + """Parse JSON string fields to dicts/lists""" + for key, value in row.items(): + if isinstance(value, str) and value.startswith('{'): + try: + row[key] = json.loads(value) + except: + pass + elif isinstance(value, str) and value.startswith('['): + try: + row[key] = json.loads(value) + except: + pass + return row + + def get_graph(self) -> Graph: + """Return the constructed graph""" + return self.graph +``` + +### Phase 5: Main Conversion Script + +Create `scripts/convert.py`: + +```python +#!/usr/bin/env python3 +""" +Convert iSamples GeoParquet export to PQG format + +Usage: + python scripts/convert.py \\ + --input isamples_export_2025_04_21_16_23_46_geo.parquet \\ + --output isamples_graph.duckdb \\ + --export-geojson samples.geojson \\ + --limit 1000 +""" + +import argparse +import sys +from pathlib import Path + +# Add src to path +sys.path.insert(0, str(Path(__file__).parent.parent / 'src')) + +from loader import GeoParquetLoader +from transformer import ISamplesToPQGTransformer + +def main(): + parser = argparse.ArgumentParser( + description='Convert iSamples GeoParquet to PQG format' + ) + parser.add_argument( + '--input', + required=True, + help='Input GeoParquet file path' + ) + parser.add_argument( + '--output', + default='isamples_graph.duckdb', + help='Output DuckDB file path' + ) + parser.add_argument( + '--export-geojson', + help='Optional: Export geographic nodes as GeoJSON' + ) + parser.add_argument( + '--export-parquet', + help='Optional: Export graph as Parquet' + ) + parser.add_argument( + '--limit', + type=int, + help='Limit number of samples to process (for testing)' + ) + parser.add_argument( + '--verbose', + action='store_true', + help='Verbose output' + ) + + args = parser.parse_args() + + # 1. Load GeoParquet + print(f"Loading GeoParquet from {args.input}...") + loader = GeoParquetLoader(args.input) + gdf = loader.load() + + if args.verbose: + print(f"Schema: {loader.get_schema()}") + print(f"Sample columns: {gdf.columns.tolist()}") + + # Limit if requested + if args.limit: + print(f"Limiting to {args.limit} samples for testing") + gdf = gdf.head(args.limit) + + # 2. Transform to PQG + print("Transforming to PQG property graph...") + transformer = ISamplesToPQGTransformer() + + for idx, row in gdf.iterrows(): + if args.verbose and idx % 1000 == 0: + print(f"Processed {idx} samples...") + + transformer.transform_row(row.to_dict()) + + graph = transformer.get_graph() + + # 3. Save graph to DuckDB + print(f"Saving graph to {args.output}...") + graph.save(args.output) + + # 4. Export additional formats if requested + if args.export_geojson: + print(f"Exporting geographic data to {args.export_geojson}...") + graph.export_geojson(args.export_geojson) + + if args.export_parquet: + print(f"Exporting graph to Parquet: {args.export_parquet}...") + graph.export_parquet(args.export_parquet) + + # 5. Print statistics + print("\nConversion complete!") + print(f"Nodes: {graph.node_count()}") + print(f"Edges: {graph.edge_count()}") + print(f"Node types: {graph.node_types()}") + print(f"Relationship types: {graph.relationship_types()}") + +if __name__ == '__main__': + main() +``` + +### Phase 6: Testing and Validation + +Create `tests/test_conversion.py`: + +```python +import pytest +from src.loader import GeoParquetLoader +from src.transformer import ISamplesToPQGTransformer +from src.models import Sample, SamplingEvent, SamplingSite + +def test_sample_transformation(): + """Test basic sample transformation""" + row = { + 'sample_identifier': 'IGSN:TEST001', + '@id': 'https://isample.org/thing/TEST001', + 'label': 'Test Sample', + 'description': 'A test sample', + 'source_collection': 'TEST', + 'informal_classification': ['rock'] + } + + transformer = ISamplesToPQGTransformer() + sample = transformer.transform_sample(row) + + assert sample.pid == 'IGSN:TEST001' + assert sample.label == 'Test Sample' + assert 'https://isample.org/thing/TEST001' in sample.altids + +def test_sampling_site_with_coordinates(): + """Test sampling site with geographic coordinates""" + sampling_site = { + 'label': 'Test Site', + 'description': 'A test location', + 'place_name': ['California', 'USA'], + 'sample_location': { + 'latitude': 37.7749, + 'longitude': -122.4194, + 'elevation': 100.0 + } + } + + transformer = ISamplesToPQGTransformer() + site = transformer.transform_sampling_site('event_1', sampling_site) + + assert site.latitude == 37.7749 + assert site.longitude == -122.4194 + assert site.geometry == 'POINT(-122.4194 37.7749)' + assert 'California' in site.place_names + +def test_full_row_transformation(): + """Test complete row transformation with all components""" + row = { + 'sample_identifier': 'IGSN:TEST002', + '@id': 'https://isample.org/thing/TEST002', + 'label': 'Full Test Sample', + 'description': 'Complete test', + 'source_collection': 'TEST', + 'has_material_category': [ + {'identifier': 'http://vocab.org/Rock'} + ], + 'produced_by': { + 'identifier': 'event_test_002', + 'label': 'Test Sampling Event', + 'result_time': '2025-01-15', + 'responsibility': [ + {'name': 'John Doe', 'role': 'Collector'} + ], + 'sampling_site': { + 'label': 'Test Location', + 'sample_location': { + 'latitude': 40.7128, + 'longitude': -74.0060 + } + } + }, + 'keywords': [{'keyword': 'geology'}], + 'registrant': {'name': 'Jane Smith'} + } + + transformer = ISamplesToPQGTransformer() + transformer.transform_row(row) + graph = transformer.get_graph() + + # Verify nodes were created + assert graph.node_count() > 0 + # Verify edges were created + assert graph.edge_count() > 0 +``` + +## Execution Plan + +### Step-by-Step Execution + +1. **Download GeoParquet file:** +```bash +# Download from Zenodo +wget https://zenodo.org/records/15278211/files/isamples_export_2025_04_21_16_23_46_geo.parquet +``` + +2. **Setup Python environment:** +```bash +python3.11 -m venv venv +source venv/bin/activate +pip install duckdb pyarrow geopandas pqg +``` + +3. **Test with small subset:** +```bash +python scripts/convert.py \\ + --input isamples_export_2025_04_21_16_23_46_geo.parquet \\ + --output test_graph.duckdb \\ + --limit 100 \\ + --verbose +``` + +4. **Run full conversion:** +```bash +python scripts/convert.py \\ + --input isamples_export_2025_04_21_16_23_46_geo.parquet \\ + --output isamples_full_graph.duckdb \\ + --export-geojson isamples_sites.geojson \\ + --verbose +``` + +5. **Validate results:** +```bash +# Use DuckDB CLI to explore +duckdb isamples_full_graph.duckdb +# Run queries to verify data +``` + +## Expected Challenges and Solutions + +### Challenge 1: Large Data Volume +**Problem**: GeoParquet file may contain millions of samples +**Solution**: +- Process in batches +- Use streaming/iterative processing +- Monitor memory usage +- Consider parallel processing for large datasets + +### Challenge 2: Nested JSON Structures +**Problem**: GeoParquet may store complex nested JSON +**Solution**: +- Implement robust JSON parsing in `_parse_json_fields()` +- Handle both string and native JSON types +- Add error handling for malformed JSON + +### Challenge 3: Duplicate Node Detection +**Problem**: Same agents/locations may appear multiple times +**Solution**: +- Use `seen_pids` set to track created nodes +- Create consistent PID generation for agents (name-based) +- For sites, use coordinate-based PIDs + +### Challenge 4: Missing Geographic Data +**Problem**: Not all samples may have coordinates +**Solution**: +- Make latitude/longitude optional in SamplingSite +- Create site PIDs from labels when coordinates missing +- Skip geometry field if coordinates unavailable + +### Challenge 5: Vocabulary Term URIs +**Problem**: Controlled vocabulary may use full URIs +**Solution**: +- Use full URI as PID +- Extract human-readable label from URI +- Store category type for filtering + +## Query Examples (Post-Conversion) + +Once converted to PQG, you can query the graph using DuckDB SQL: + +```sql +-- Find all samples from SESAR +SELECT * FROM nodes +WHERE otype = 'Sample' +AND source_collection = 'SESAR'; + +-- Find all sampling sites in a region +SELECT * FROM nodes +WHERE otype = 'SamplingSite' +AND latitude BETWEEN 30 AND 40 +AND longitude BETWEEN -120 AND -110; + +-- Find samples by material category +SELECT s.* +FROM nodes s +JOIN edges e ON s.pid = e.s +JOIN nodes v ON e.o[1] = v.row_id +WHERE s.otype = 'Sample' +AND e.p = 'has_material_category' +AND v.category = 'material'; + +-- Find all samples collected by a specific person +SELECT s.* +FROM nodes s +JOIN edges e1 ON s.pid = e1.s +JOIN edges e2 ON e1.o[1] IN (SELECT row_id FROM nodes WHERE pid IN (SELECT o[1] FROM edges WHERE s = e1.o[1])) +JOIN nodes agent ON agent.row_id = e2.o[1] +WHERE s.otype = 'Sample' +AND agent.label = 'John Doe' +AND agent.role = 'Collector'; +``` + +## Performance Considerations + +- **Batch size**: Process 10,000-50,000 records per batch +- **Memory**: Monitor with `--limit` during testing +- **Indexing**: PQG/DuckDB handles indexing automatically +- **Export time**: Full dataset may take 30-60 minutes +- **Storage**: Expect 2-3x size increase due to graph structure + +## Next Steps + +1. Obtain the GeoParquet export code from the iSamples team (not found in current repository) +2. Implement the data models and transformer classes +3. Test with small subset (100-1000 samples) +4. Validate graph structure and relationships +5. Run full conversion +6. Create sample queries for common use cases +7. Document query patterns for end users + +## References + +- **PQG Documentation**: https://github.com/isamplesorg/pqg +- **iSamples GeoParquet**: https://zenodo.org/records/15278211 +- **iSamples Metadata Schema**: See `isb_lib/models/isb_core_record.py` +- **GeoParquet Specification**: https://geoparquet.org/ + +--- + +**Document Version**: 1.0 +**Last Updated**: 2025-11-14 +**Author**: Claude (AI Assistant)