-
Notifications
You must be signed in to change notification settings - Fork 16
Open
Description
Background
The Spark module currently generates HBase tables for point and tile maps using taxonomic identifiers from the GBIF Backbone taxonomy. To support the Catalogue of Life (CoL) Extended release, we need to extend the existing infrastructure to process and store maps based on CoL taxonomic identifiers.
Current Architecture
The current implementation:
- Reads GBIF occurrence data from Avro files
- Extracts GBIF Backbone taxonomic keys:
kingdomKey,phylumKey,classKey,orderKey,familyKey,genusKey,speciesKey,taxonKey - Generates point maps (for low-occurrence views) and tile pyramids (for high-occurrence views)
- Stores data in HBase tables with multiple EPSG projections (4326, 3857, 3575, 3031)
Proposed Solution
Extend the existing Spark jobs to support CoL identifiers and create parallel processing workflows for CoL Extended release maps.
Suggested Tasks
1. Spark Job Modifications
Files requiring changes:
-
MapBuilder.java: UpdatereadAvroSource()method to select CoL identifier fields- Add CoL taxonomic key columns (e.g.,
colKingdomKey,colPhylumKey, etc.) to the.select()statement - Parameterize taxonomy source to support both GBIF and CoL
- Add CoL taxonomic key columns (e.g.,
-
MapKeysUDF.java: Extend the UDF to handle CoL taxonomic hierarchies- Add support for CoL identifier parameters
- Ensure map key generation works with CoL taxonomy structure
- Consider adding a taxonomy source parameter to distinguish between GBIF and CoL keys
-
PointMapBuilder.javaandTileMapBuilder.java: Update SQL queries to use CoL fields- Modify the
mapKeys()UDF calls to include CoL identifiers - Ensure proper grouping and aggregation with CoL taxonomy
- Modify the
2. Configuration Management
- Create separate configuration files for CoL Extended (e.g.,
col-extended-dev.yml,col-extended-prod.yml)- Define CoL-specific HBase table names (e.g.,
col_maps_points,col_maps_tiles) - Configure separate target directories for CoL map outputs
- Set appropriate threshold values for tile pyramid generation
- Define separate Hive database/table names for CoL processing
- Define CoL-specific HBase table names (e.g.,
3. Airflow Workflows
- Create new Airflow DAG for CoL point maps generation
- Create new Airflow DAG for CoL tile maps generation
- Add monitoring and alerting for CoL maps
4. HBase Table Management
- Create separate HBase tables for CoL Extended maps
col_maps_points_<timestamp>for point mapscol_maps_tiles_<timestamp>for tile pyramids
- Update table creation scripts in
PrepareBackfill.javato support taxonomy parameter - Ensure cleanup and snapshot management works for CoL tables
5. Testing & Validation
- Unit tests for CoL-specific UDF functionality
- Integration tests with sample CoL occurrence data
- Validate map generation for various CoL taxonomic ranks
6. Documentation
- Update README.md with CoL Extended support details
- Document configuration parameters for CoL workflows
- Add examples for running CoL map generation
- Update deployment and operational documentation
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels