Skip to content

Return matching raw values for multi-valued luc:query fields#3823

Closed
recalcitrantsupplant wants to merge 64 commits intoapache:mainfrom
aiworkerjohns:fix/lucene-matchraw-multivalue
Closed

Return matching raw values for multi-valued luc:query fields#3823
recalcitrantsupplant wants to merge 64 commits intoapache:mainfrom
aiworkerjohns:fix/lucene-matchraw-multivalue

Conversation

@recalcitrantsupplant
Copy link
Copy Markdown

Summary

  • fix luc:query raw-value binding for multi-valued SHACL fields
  • choose the stored value that actually matches the Lucene query instead of always returning the first stored value
  • add a regression test covering multi-valued identifier fields
  • add a demo query and sample data to reproduce the behavior manually

Testing

  • added TestShaclLucQueryRawValueOnMultiValuedField
  • refreshed the demo dataset and verified demo/test/queries/09-matchraw-multivalue.rq

aiworkerjohns and others added 30 commits January 15, 2026 13:11
…d comprehensive tests

- Add FacetValue.java: Immutable class representing facet value/count pairs
- Add FacetedTextResults.java: Container for search results with faceting data
- Add TestFacetedResults.java: Comprehensive test suite (6 test methods)
- Add faceting_methods.txt: Implementation for queryWithFacets$ method

This adds faceting capability to Lucene text indexing in Apache Jena.
Code is production-ready but requires build fixes to run tests.
…o jena-text module

- Deleted BUILD_FIX_GUIDE.md, PROJECT_STATUS.md, and PROJECT_TESTING.md as they are no longer needed.
- Added dependency for lucene-facet in pom.xml to enable native faceting capabilities.
- Enhanced TextIndexConfig and TextIndexLucene classes to support faceting, including methods for retrieving facet counts.
- Updated TextQuery to register a new property function for facet counts.

This commit streamlines the documentation and integrates faceting functionality into the jena-text module.
…xt module

- Updated FEAT_FACETS_OUTPUT.md to reflect the new test results and added details about filtered facets.
- Enhanced FEAT_FACETS_SPEC.md to include specifications for filtered facets in the text:facetCounts function.
- Revised FEAT_FACETS_TESTING.md to add tests for filtered facets, including examples for multi-word and single-word queries.
- Modified TextFacetCountsPF.java and TextIndexLucene.java to support detection of search queries in facet counts.
- Updated test cases in TestTextFacetCountsPF.java to validate the new filtered facets functionality.

This commit improves the documentation and testing framework for the newly implemented filtered facets feature, ensuring clarity and comprehensive coverage.
- Replaced the `text:queryWithFacets` and `text:facetCounts` functions with `text:query` and `text:facet`, streamlining the API for clarity and usability.
- Updated documentation in FEAT_FACETS_SPEC.md and FEAT_FACETS_TESTING.md to reflect the new API structure and usage examples.
- Improved the implementation in TextIndexLucene and related classes to support the new faceting methods.
- Removed obsolete classes and methods related to the previous faceting implementation.

This commit enhances the faceting capabilities in the jena-text module, ensuring a more intuitive API and comprehensive documentation.
- Introduced new methods for faceting in the jena-text module, replacing outdated functions for better clarity.
- Updated FEAT_FACETS_SPEC.md and FEAT_FACETS_TESTING.md to include new usage examples and specifications.
- Improved implementation in TextIndexLucene to align with the updated API.
- Removed deprecated classes and methods to streamline the codebase.

This commit refines the faceting capabilities, ensuring a more intuitive API and comprehensive documentation for users.
…mentation

- Introduced a new section in PHASE2_DESIGN.md detailing how to construct JSON filter arguments dynamically with Jena's Composite Datatype (CDT) extension.
- Provided a SPARQL example demonstrating the creation of a JSON filter from `VALUES` clauses, enhancing user understanding of programmatic filter construction.
- Clarified that the CDT `FOLD` function is a Jena extension, emphasizing its utility for users seeking to build filters without hardcoding values.

This update improves the documentation by offering practical usage patterns for dynamic JSON filter creation in SPARQL queries.
- Deleted outdated documentation files related to previous implementations, including `2026-01-23-david-review.md`, `2026-01-27-david-recommendation.md`, `2026-02-09-next-steps.md`, and others.
- Consolidated and updated the faceting API, replacing `text:queryWithFacets` and `text:facetCounts` with `text:query` and `text:facet` for improved clarity and usability.
- Enhanced documentation to reflect the new API structure and added examples for the updated faceting functionality.
- Ensured that the implementation aligns with the latest design decisions and user requirements.

This commit streamlines the documentation and finalizes the faceting API changes, enhancing usability and clarity for users.
- Revised user guide to clarify the differences between Classic and SHACL modes, including updated examples for `text:query` and `luc:query`.
- Enhanced SPARQL API reference to include detailed syntax and examples for `luc:query` and `luc:facet`, reflecting the new functionality.
- Improved configuration documentation to outline properties for both indexing modes, emphasizing the use of `text:shapes` and `text:entityMap`.
- Added sections on deploying with Fuseki and testing, ensuring comprehensive guidance for users.

This commit enhances the clarity and usability of the documentation, aligning it with recent API changes and user needs.
- Updated the Lucene Classic Query Parser documentation link to the latest version.
- Corrected the number of pre-existing tests from 327 to 303 in the testing documentation, ensuring accuracy in test coverage reporting.

These changes enhance the clarity and accuracy of the documentation, reflecting recent updates and maintaining consistency.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- New docs/08-use-cases.md: building-block format showing each feature
  with mermaid diagrams and practical application examples
- Reorder proposed features: inverse/sequence paths, spatial, then
  deferrable group (DrillSideways, hierarchical, range, grouping, suggest)
- Add inverse/sequence paths to feature status table and timeline
- Group deferrable extensions in future work with API impact notes
- Fix diagram readability: transparent subgraph backgrounds, theme-aware text

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explicit instruction to only create issues/PRs on the fork
(aiworkerjohns/jena), never on upstream (apache/jena).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keeps README high-level with component overview and roadmap.
Detailed sequence and flowchart diagrams now live in 04-architecture.md
where they complement the existing technical content.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
update example
When luc:query and luc:facet appear in the same BGP, SPARQL
cross-joins the results (N×M rows). Updated docs to recommend
separate queries or UNION, and added a design decision documenting
the analysis of handle-based alternatives.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Closes #15. Includes Fuseki SHACL-mode config with three shapes
(MiningReport, Borehole, Site), 21 hand-crafted entities, 5 demo
queries covering luc:query and luc:facet patterns, a Python synthetic
data generator, and a go-task Taskfile for build/serve/load/query
workflows. All queries verified end-to-end against a running Fuseki
instance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Quote desc values containing colons to fix YAML parse error
- Add Apache License headers to Taskfile.yml and all .rq query files
  (required by RAT plugin)
- Add task stop for clean server shutdown (SIGTERM)
- Add git commit style rule to CLAUDE.md
Add Dockerfile, docker-compose.yml, and Taskfile tasks for building
and pushing the demo image to GitHub CR or Azure CR. Includes
DockerReadme.md documenting the full workflow.
Extend SHACL entity-per-document indexing to support complex SHACL
property paths (sequence, inverse, alternative) in addition to simple
predicate paths. This enables indexing values that are reachable via
multi-hop traversals or reverse relationships in the RDF graph.

Implementation:
- ShaclIndexAssembler: parse full SHACL path syntax into jena-arq Path
  objects, extract leaf predicates for change listener compatibility
- ShaclTextDocProducer: use PathEval for complex paths, keep fast direct
  triple match for simple predicates
- ShaclIndexMapping: add Path field to FieldDef with backward-compatible
  constructor

Demo: add authors with sequence path (ex:authoredBy/ex:name) and inverse
path (^ex:authored) fields, new queries, and README with expected results.

Tests: 8 new tests (6 in TestShaclPathSupport, 2 in TestShaclAssembler).

Closes #7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Quote desc values containing colons to fix YAML parse error
- Add Apache License headers to Taskfile.yml and all .rq query files
  (required by RAT plugin)
- Add task stop for clean server shutdown (SIGTERM)
- Update README with stop step

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge DockerReadme.md content into README.md covering image build,
GHCR/ACR push, and docker compose workflows. Add missing Apache
license header to .dockerignore.
- Add stats.html page with live dataset statistics from Fuseki
- Add project-overview/ interactive presentation (slide deck)
- Move mining test data into demo/test/ subfolder
- Add drillhole config (GSWA geochemistry shapes)
- Update Taskfile with drillhole serve/load and test-* tasks
- Add Stats nav tab to all pages
- Update .gitignore for new folder structure
aiworkerjohns and others added 26 commits March 3, 2026 18:49
New fuseki-loader image for bulk data operations: loads N-Quads/Turtle/NT
files into TDB2 via tdb2.tdbloader, then builds the SHACL Lucene index
via shacltextindexer. Configured via environment variables and volume
mounts, sharing volumes with the fuseki-ai server image.
Loader entrypoint now supports MODE=all|load|index to run steps
independently. Added Apache license headers to loader files. Documented
GHCR auth account requirement in CLAUDE.md.
Add SHACL bulk reindexer and offline loader Docker image
Replace the flat Map<String, List<String>> filter format in SHACL mode
with OGC CQL2-JSON as the single filter contract. This enables arbitrary
boolean trees (and/or/not), comparisons, set membership (in), range
queries (between), pattern matching (like), and spatial stubs.

Key changes:
- CQL AST model (sealed interface with records) and JSON parser
- CQL-to-Lucene compiler with pushdown/residual split
- TextIndexRegistry for named multi-index support
- CompositeTextDocProducer for multi-index change routing
- SortSpec/SortSpecParser for sort pushdown to Lucene
- luc:query and luc:facet now require indexId as first argument
- SearchExecution updated for CQL filter + sort + indexId in cache key
- TextDatasetAssembler supports text:indexes (RDF list) config
- Demo queries and app updated for new CQL2-JSON syntax

Closes #22, closes #24. Progress on #21, #8 (spatial stub only).

430 tests pass (0 failures).
Add text:TextIndexShacl RDF type with dedicated ShaclTextIndexAssembler,
removing SHACL branching from TextIndexLuceneAssembler. Each assembler
now handles exactly one mode — classic entityMap or SHACL shapes.
Move all SHACL entity-per-document logic (faceting, CQL filters, sort
pushdown, document building) into ShaclTextIndexLucene subclass.
TextIndexLucene now contains only classic triple-per-document code with
zero isShaclMode() branching.
Reorganise demo task runners so each dataset has consistent command
names (serve, load, app, clean) in its own Taskfile:
- demo/Taskfile.yml → mining dataset (port 3031) + shared infra
- demo/drillhole/Taskfile.yml → drillhole dataset (port 3030)
… entities

Backend: Add LatLonField type to SHACL index with LatLonShape triangle
indexing. CqlToLuceneCompiler handles s_intersects for both bbox and
GeoJSON Polygon geometries. WKT parsing supports EPSG:4326 and CRS84
axis orders.

Frontend: Add interactive bbox and polygon drawing tools on the map panel.
Polygon draw uses click-to-place vertices with Done button. Both spatial
filters integrate with CQL2-JSON query pipeline.

Demo: Scale generate.py to produce 500 entities across 14 weighted
Australian mining regions with realistic coordinates, commodity
distributions, and mixed point/polygon geometries.
CQL2-JSON filters, multi-index registry, and sort pushdown
Data-driven from tests.json — 34 test cases across 11 groups covering
full-text search, faceted filtering, spatial bbox, and combined queries.
Generates a Markdown report with embedded screenshots.
… binding (#29, #30)

Replace index ID + predicate URI arguments with Lucene field name literals.
First argument is now a field spec: "default", a field name, or a JSON array.
Single-field queries bind ?field to the searched field name as xsd:string.
Add spatial filtering with bbox and polygon support
* Return typed values and field IRIs from luc:query and luc:facet (#31, #34)

- ?field binding returns a URI (auto-generated urn:jena:lucene:index#field/{name}
  for blank node fields, or the config resource IRI for named fields)
- ?literal binding populated from Lucene stored values with type determined by
  field type: KEYWORD with URI values → IRI node, TEXT → string literal,
  numeric → typed literal. Non-URI KEYWORD values fall back to string literals.
- luc:facet ?value follows the same type mapping
- Demo data migrated from string literals to IRIs for KEYWORD fields
  (commodity, state, operator, status)
- Demo app updated to handle URI-typed facet values and field bindings

* Fix invalid Turtle syntax in demo data and add parsing tests

Use dedicated prefixes (commodity:, state:, operator:, status:) instead
of slashes in prefixed local names (ex:commodity/Copper) which is invalid
Turtle syntax. Add TestDemoDataParsing to validate demo data files parse
correctly as a regression guard.

* Change field IRI prefix to urn:jena:lucene:field#{name}

The previous prefix urn:jena:lucene:index#field/{name} caused
shortName() extraction to return "field/{name}" instead of just
"{name}" because # is parsed before /. The new scheme allows
standard URI local-name extraction to work correctly.

* Add minResults assertions to demo test cases

Update tests.json with expected minimum result counts for each test
case and IRI-encoded facet values. Playwright spec now asserts
resultCount >= minResults when specified.

* Fix race condition in search and improve error handling

Add AbortController to cancel in-flight SPARQL requests when a new
search starts, preventing stale responses from causing errors during
rapid facet toggling. Fix error catch to only show "Cannot connect"
for actual network errors, not TypeErrors from JS code bugs.

* Use CQL filter param for URL state instead of individual facet params

Replace per-field URL params (commodity=X&state=Y&bbox=...) with a
single ?filter= param containing the CQL2-JSON string. Simplifies
URL handling — buildCqlFilter serializes state to URL, parseCqlFilter
deserializes back. Also fix WKT map marker parsing (was passing
property value object instead of raw string) and remove debug logging.

* Fix Leaflet fitBounds animation error on hidden map

Disable animation on fitBounds to prevent TypeError when the map
container is not fully rendered during search updates.

* Add SPARQL editor, CQL viewer, named field IRIs, and UI improvements

- SPARQL editor popup: click log entries to open, editable endpoint,
  table/JSON result views, drag/resize, click-outside-to-close
- CQL viewer popup: collapsible JSON tree with expand/collapse all,
  object/raw toggle
- Named field IRIs: config uses named resources instead of blank nodes,
  field IRI column added to config page, raw TTL view toggle
- entityType field changed from sequence path (rdf:type rdfs:label)
  to direct rdf:type path for IRI-based faceting
- Spatial overlays set to interactive:false so map markers stay clickable
- Log renamed from "SPARQL Log" to "Log", CQL filter entries added
- Active pill color changed to steel blue
- Test dropdown URLs properly percent-encoded
- Screenshot baselines updated for all UI changes

* Hide non-facetable literal values from result card pills

Skip properties that are not facet-mapped and contain only literal
values (e.g., depth, year) from rendering as tag pills on result cards.

* Fix stats page facet breakdown with named field IRIs

Extract resolveFieldName as shared utility function and use it in
statsApp to resolve field URIs back to field names.

* Show short names for facet values on stats page

* Add short/full name toggle on stats page facet breakdown

* Rename stats toggle buttons to Short Name / Full Name

* Use field IRIs instead of field names in luc:query and luc:facet APIs

Resolve field IRIs to Lucene field names in ShaclTextIndexLucene so
that queries and facet requests can use IRIs as field identifiers.
ShaclIndexMapping.findField() now accepts both field names and IRIs,
matching by local name for IRI lookups.

* Update docs for field IRIs, CQL2-JSON filters, spatial, and paths

- Move inverse/sequence paths and spatial filtering from Proposed to Done
- Update all filter examples from old JSON format to CQL2-JSON syntax
- Add named field resource config examples showing field IRI preservation
- Add sequence and inverse path examples to configuration reference
- Add LatLonField to field types table with link to spatial docs
- Add forward chaining replacement section to use cases
- Add field IRI and spatial entries to feature status table and roadmap
- Update Classic vs SHACL comparison table

* Remove classic mode references from documentation

Focus docs on SHACL mode only. Classic mode (text:entityMap / text:query)
is upstream Jena — a brief note points readers to Apache Jena docs.
Removes duplicate config examples, comparison tables, and dual-mode
framing that added noise without value.

* Enforce field IRIs as sole identifier in SPARQL APIs and docs

findField() now matches exact field IRIs only; findFieldByName()
added for internal Lucene field name lookups. All external-facing
APIs (luc:query field specs, luc:facet facet arrays, CQL2-JSON
property values, sort specs) require field IRIs. Demo config
migrated to absolute PREFIX field: <urn:jena:lucene:field#>.
All documentation updated to use field IRIs consistently.
* refactor(demo): unify on port 3030 and simplify Docker workflow

- Update Taskfile and config to use port 3030 for all mining tasks
- Switch Docker build to multi-stage in-container Maven builds
- Add docker-start-ghcr and docker-serve-ghcr for remote images
- Improve Taskfile dependency management with task dependencies
- Update documentation and add .dockerignore for cleaner builds

* Add GitHub Actions workflow for Docker image builds

Builds and pushes fuseki-ai and fuseki-loader images to GHCR on
push to main and on manual dispatch. Builds multi-arch (amd64 +
arm64) with GHA build cache. Uses GITHUB_TOKEN for auth.

* Make loader image extend fuseki-ai instead of duplicating build

Loader Dockerfile now uses fuseki-ai as base image and just adds
entrypoint.sh. Eliminates duplicate Maven build — loader-build
depends on image-build and layers on top. Updated GH Actions
workflow to pass base image args. Updated entrypoint.sh jar path
to match fuseki image layout.

---------

Co-authored-by: aiworkerjohns <aiworker.johns@gmail.com>
* Support wildcard ["*"] in luc:facet to request all facetable fields

When ["*"] is passed as the facet field list, resolveFacetFieldNames()
expands it to all configured facet fields, avoiding the need for
clients to enumerate every field IRI explicitly.

* Add integration tests for demo mining scenarios

22 tests covering multi-entity-type indexing, facet wildcard,
CQL filters, sequence paths, and combined query+facet patterns
modelled on the demo mining dataset.

---------

Co-authored-by: aiworkerjohns <aiworker.johns@gmail.com>
The loader Docker build needs demo/loader/entrypoint.sh but it was
excluded by the blanket demo/ ignore rule.
- Make FUSEKI_PORT and APP_PORT configurable (defaults: 3030, 8000)
- Add serve_app.py reverse proxy for CORS-free frontend development
- Add task refresh: one-command clean/restart/reload/serve workflow
- Update task app to use proxy instead of simple HTTP server
* Add per-field query analyzer support (idx:queryAnalyzer)

Allow fields to specify separate analyzers for indexing and querying.
This enables patterns like edge n-gram indexing with keyword querying
for prefix/typeahead search on identifier fields.

* Add EdgeNGramAnalyzer and demo identifier field with prefix search

- Add text:EdgeNGramAnalyzer assembler type for prefix/typeahead indexing
- Add ex:identifier field to demo config with idx:queryAnalyzer example
- Add identifier data to all mining.ttl entities (sites, boreholes, reports)
- Add assembler test verifying idx:queryAnalyzer wiring

* Improve identifier prefix search demo

* Fix review issues and add coverage tests

- Fix LowerCaseFilter import (Lucene 10 compatibility)
- Add log.warn when global text:queryAnalyzer shadows per-field overrides
- Revert default ports to 3030/8000 (keep parameterization)
- Add 4 tests: normal field alongside prefix field, field isolation,
  FieldDef without queryAnalyzer fallback, EntityDefinition wiring

---------

Co-authored-by: aiworkerjohns <aiworker.johns@gmail.com>
@recalcitrantsupplant
Copy link
Copy Markdown
Author

apologies, wrong branch again

@afs
Copy link
Copy Markdown
Member

afs commented Mar 31, 2026

Would you mind unlinking your repo so that it isn't a fork? You can still have Jena as an upstream repo and pull changes.

@recalcitrantsupplant
Copy link
Copy Markdown
Author

the repo has been unlinked now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants