Skip to content

Support maps for CoL Extended taxonomy #105

@fmendezh

Description

@fmendezh

Background

The Spark module currently generates HBase tables for point and tile maps using taxonomic identifiers from the GBIF Backbone taxonomy. To support the Catalogue of Life (CoL) Extended release, we need to extend the existing infrastructure to process and store maps based on CoL taxonomic identifiers.

Current Architecture

The current implementation:

  • Reads GBIF occurrence data from Avro files
  • Extracts GBIF Backbone taxonomic keys: kingdomKey, phylumKey, classKey, orderKey, familyKey, genusKey, speciesKey, taxonKey
  • Generates point maps (for low-occurrence views) and tile pyramids (for high-occurrence views)
  • Stores data in HBase tables with multiple EPSG projections (4326, 3857, 3575, 3031)

Proposed Solution

Extend the existing Spark jobs to support CoL identifiers and create parallel processing workflows for CoL Extended release maps.

Suggested Tasks

1. Spark Job Modifications

Files requiring changes:

  • MapBuilder.java: Update readAvroSource() method to select CoL identifier fields

    • Add CoL taxonomic key columns (e.g., colKingdomKey, colPhylumKey, etc.) to the .select() statement
    • Parameterize taxonomy source to support both GBIF and CoL
  • MapKeysUDF.java: Extend the UDF to handle CoL taxonomic hierarchies

    • Add support for CoL identifier parameters
    • Ensure map key generation works with CoL taxonomy structure
    • Consider adding a taxonomy source parameter to distinguish between GBIF and CoL keys
  • PointMapBuilder.java and TileMapBuilder.java: Update SQL queries to use CoL fields

    • Modify the mapKeys() UDF calls to include CoL identifiers
    • Ensure proper grouping and aggregation with CoL taxonomy

2. Configuration Management

  • Create separate configuration files for CoL Extended (e.g., col-extended-dev.yml, col-extended-prod.yml)
    • Define CoL-specific HBase table names (e.g., col_maps_points, col_maps_tiles)
    • Configure separate target directories for CoL map outputs
    • Set appropriate threshold values for tile pyramid generation
    • Define separate Hive database/table names for CoL processing

3. Airflow Workflows

  • Create new Airflow DAG for CoL point maps generation
  • Create new Airflow DAG for CoL tile maps generation
  • Add monitoring and alerting for CoL maps

4. HBase Table Management

  • Create separate HBase tables for CoL Extended maps
    • col_maps_points_<timestamp> for point maps
    • col_maps_tiles_<timestamp> for tile pyramids
  • Update table creation scripts in PrepareBackfill.java to support taxonomy parameter
  • Ensure cleanup and snapshot management works for CoL tables

5. Testing & Validation

  • Unit tests for CoL-specific UDF functionality
  • Integration tests with sample CoL occurrence data
  • Validate map generation for various CoL taxonomic ranks

6. Documentation

  • Update README.md with CoL Extended support details
  • Document configuration parameters for CoL workflows
  • Add examples for running CoL map generation
  • Update deployment and operational documentation

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions