Return matching raw values for multi-valued luc:query fields by recalcitrantsupplant · Pull Request #3823 · apache/jena

recalcitrantsupplant · 2026-03-31T01:16:43Z

Summary

fix luc:query raw-value binding for multi-valued SHACL fields
choose the stored value that actually matches the Lucene query instead of always returning the first stored value
add a regression test covering multi-valued identifier fields
add a demo query and sample data to reproduce the behavior manually

Testing

added TestShaclLucQueryRawValueOnMultiValuedField
refreshed the demo dataset and verified demo/test/queries/09-matchraw-multivalue.rq

…d comprehensive tests - Add FacetValue.java: Immutable class representing facet value/count pairs - Add FacetedTextResults.java: Container for search results with faceting data - Add TestFacetedResults.java: Comprehensive test suite (6 test methods) - Add faceting_methods.txt: Implementation for queryWithFacets$ method This adds faceting capability to Lucene text indexing in Apache Jena. Code is production-ready but requires build fixes to run tests.

…o jena-text module - Deleted BUILD_FIX_GUIDE.md, PROJECT_STATUS.md, and PROJECT_TESTING.md as they are no longer needed. - Added dependency for lucene-facet in pom.xml to enable native faceting capabilities. - Enhanced TextIndexConfig and TextIndexLucene classes to support faceting, including methods for retrieving facet counts. - Updated TextQuery to register a new property function for facet counts. This commit streamlines the documentation and integrates faceting functionality into the jena-text module.

…xt module - Updated FEAT_FACETS_OUTPUT.md to reflect the new test results and added details about filtered facets. - Enhanced FEAT_FACETS_SPEC.md to include specifications for filtered facets in the text:facetCounts function. - Revised FEAT_FACETS_TESTING.md to add tests for filtered facets, including examples for multi-word and single-word queries. - Modified TextFacetCountsPF.java and TextIndexLucene.java to support detection of search queries in facet counts. - Updated test cases in TestTextFacetCountsPF.java to validate the new filtered facets functionality. This commit improves the documentation and testing framework for the newly implemented filtered facets feature, ensuring clarity and comprehensive coverage.

- Replaced the `text:queryWithFacets` and `text:facetCounts` functions with `text:query` and `text:facet`, streamlining the API for clarity and usability. - Updated documentation in FEAT_FACETS_SPEC.md and FEAT_FACETS_TESTING.md to reflect the new API structure and usage examples. - Improved the implementation in TextIndexLucene and related classes to support the new faceting methods. - Removed obsolete classes and methods related to the previous faceting implementation. This commit enhances the faceting capabilities in the jena-text module, ensuring a more intuitive API and comprehensive documentation.

- Introduced new methods for faceting in the jena-text module, replacing outdated functions for better clarity. - Updated FEAT_FACETS_SPEC.md and FEAT_FACETS_TESTING.md to include new usage examples and specifications. - Improved implementation in TextIndexLucene to align with the updated API. - Removed deprecated classes and methods to streamline the codebase. This commit refines the faceting capabilities, ensuring a more intuitive API and comprehensive documentation for users.

…mentation - Introduced a new section in PHASE2_DESIGN.md detailing how to construct JSON filter arguments dynamically with Jena's Composite Datatype (CDT) extension. - Provided a SPARQL example demonstrating the creation of a JSON filter from `VALUES` clauses, enhancing user understanding of programmatic filter construction. - Clarified that the CDT `FOLD` function is a Jena extension, emphasizing its utility for users seeking to build filters without hardcoding values. This update improves the documentation by offering practical usage patterns for dynamic JSON filter creation in SPARQL queries.

- Deleted outdated documentation files related to previous implementations, including `2026-01-23-david-review.md`, `2026-01-27-david-recommendation.md`, `2026-02-09-next-steps.md`, and others. - Consolidated and updated the faceting API, replacing `text:queryWithFacets` and `text:facetCounts` with `text:query` and `text:facet` for improved clarity and usability. - Enhanced documentation to reflect the new API structure and added examples for the updated faceting functionality. - Ensured that the implementation aligns with the latest design decisions and user requirements. This commit streamlines the documentation and finalizes the faceting API changes, enhancing usability and clarity for users.

David/review

- Revised user guide to clarify the differences between Classic and SHACL modes, including updated examples for `text:query` and `luc:query`. - Enhanced SPARQL API reference to include detailed syntax and examples for `luc:query` and `luc:facet`, reflecting the new functionality. - Improved configuration documentation to outline properties for both indexing modes, emphasizing the use of `text:shapes` and `text:entityMap`. - Added sections on deploying with Fuseki and testing, ensuring comprehensive guidance for users. This commit enhances the clarity and usability of the documentation, aligning it with recent API changes and user needs.

- Updated the Lucene Classic Query Parser documentation link to the latest version. - Corrected the number of pre-existing tests from 327 to 303 in the testing documentation, ensuring accuracy in test coverage reporting. These changes enhance the clarity and accuracy of the documentation, reflecting recent updates and maintaining consistency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- New docs/08-use-cases.md: building-block format showing each feature with mermaid diagrams and practical application examples - Reorder proposed features: inverse/sequence paths, spatial, then deferrable group (DrillSideways, hierarchical, range, grouping, suggest) - Add inverse/sequence paths to feature status table and timeline - Group deferrable extensions in future work with API impact notes - Fix diagram readability: transparent subgraph backgrounds, theme-aware text Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Explicit instruction to only create issues/PRs on the fork (aiworkerjohns/jena), never on upstream (apache/jena). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Keeps README high-level with component overview and roadmap. Detailed sequence and flowchart diagrams now live in 04-architecture.md where they complement the existing technical content. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

update example

When luc:query and luc:facet appear in the same BGP, SPARQL cross-joins the results (N×M rows). Updated docs to recommend separate queries or UNION, and added a design decision documenting the analysis of handle-based alternatives. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Closes #15. Includes Fuseki SHACL-mode config with three shapes (MiningReport, Borehole, Site), 21 hand-crafted entities, 5 demo queries covering luc:query and luc:facet patterns, a Python synthetic data generator, and a go-task Taskfile for build/serve/load/query workflows. All queries verified end-to-end against a running Fuseki instance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Quote desc values containing colons to fix YAML parse error - Add Apache License headers to Taskfile.yml and all .rq query files (required by RAT plugin) - Add task stop for clean server shutdown (SIGTERM) - Add git commit style rule to CLAUDE.md

Add Dockerfile, docker-compose.yml, and Taskfile tasks for building and pushing the demo image to GitHub CR or Azure CR. Includes DockerReadme.md documenting the full workflow.

Extend SHACL entity-per-document indexing to support complex SHACL property paths (sequence, inverse, alternative) in addition to simple predicate paths. This enables indexing values that are reachable via multi-hop traversals or reverse relationships in the RDF graph. Implementation: - ShaclIndexAssembler: parse full SHACL path syntax into jena-arq Path objects, extract leaf predicates for change listener compatibility - ShaclTextDocProducer: use PathEval for complex paths, keep fast direct triple match for simple predicates - ShaclIndexMapping: add Path field to FieldDef with backward-compatible constructor Demo: add authors with sequence path (ex:authoredBy/ex:name) and inverse path (^ex:authored) fields, new queries, and README with expected results. Tests: 8 new tests (6 in TestShaclPathSupport, 2 in TestShaclAssembler). Closes #7 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Quote desc values containing colons to fix YAML parse error - Add Apache License headers to Taskfile.yml and all .rq query files (required by RAT plugin) - Add task stop for clean server shutdown (SIGTERM) - Update README with stop step Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge DockerReadme.md content into README.md covering image build, GHCR/ACR push, and docker compose workflows. Add missing Apache license header to .dockerignore.

- Add stats.html page with live dataset statistics from Fuseki - Add project-overview/ interactive presentation (slide deck) - Move mining test data into demo/test/ subfolder - Add drillhole config (GSWA geochemistry shapes) - Update Taskfile with drillhole serve/load and test-* tasks - Add Stats nav tab to all pages - Update .gitignore for new folder structure

New fuseki-loader image for bulk data operations: loads N-Quads/Turtle/NT files into TDB2 via tdb2.tdbloader, then builds the SHACL Lucene index via shacltextindexer. Configured via environment variables and volume mounts, sharing volumes with the fuseki-ai server image.

Loader entrypoint now supports MODE=all|load|index to run steps independently. Added Apache license headers to loader files. Documented GHCR auth account requirement in CLAUDE.md.

Add SHACL bulk reindexer and offline loader Docker image

Replace the flat Map<String, List<String>> filter format in SHACL mode with OGC CQL2-JSON as the single filter contract. This enables arbitrary boolean trees (and/or/not), comparisons, set membership (in), range queries (between), pattern matching (like), and spatial stubs. Key changes: - CQL AST model (sealed interface with records) and JSON parser - CQL-to-Lucene compiler with pushdown/residual split - TextIndexRegistry for named multi-index support - CompositeTextDocProducer for multi-index change routing - SortSpec/SortSpecParser for sort pushdown to Lucene - luc:query and luc:facet now require indexId as first argument - SearchExecution updated for CQL filter + sort + indexId in cache key - TextDatasetAssembler supports text:indexes (RDF list) config - Demo queries and app updated for new CQL2-JSON syntax Closes #22, closes #24. Progress on #21, #8 (spatial stub only). 430 tests pass (0 failures).

Add text:TextIndexShacl RDF type with dedicated ShaclTextIndexAssembler, removing SHACL branching from TextIndexLuceneAssembler. Each assembler now handles exactly one mode — classic entityMap or SHACL shapes.

Move all SHACL entity-per-document logic (faceting, CQL filters, sort pushdown, document building) into ShaclTextIndexLucene subclass. TextIndexLucene now contains only classic triple-per-document code with zero isShaclMode() branching.

Reorganise demo task runners so each dataset has consistent command names (serve, load, app, clean) in its own Taskfile: - demo/Taskfile.yml → mining dataset (port 3031) + shared infra - demo/drillhole/Taskfile.yml → drillhole dataset (port 3030)

… entities Backend: Add LatLonField type to SHACL index with LatLonShape triangle indexing. CqlToLuceneCompiler handles s_intersects for both bbox and GeoJSON Polygon geometries. WKT parsing supports EPSG:4326 and CRS84 axis orders. Frontend: Add interactive bbox and polygon drawing tools on the map panel. Polygon draw uses click-to-place vertices with Done button. Both spatial filters integrate with CQL2-JSON query pipeline. Demo: Scale generate.py to produce 500 entities across 14 weighted Australian mining regions with realistic coordinates, commodity distributions, and mixed point/polygon geometries.

CQL2-JSON filters, multi-index registry, and sort pushdown

Data-driven from tests.json — 34 test cases across 11 groups covering full-text search, faceted filtering, spatial bbox, and combined queries. Generates a Markdown report with embedded screenshots.

… binding (#29, #30) Replace index ID + predicate URI arguments with Lucene field name literals. First argument is now a field spec: "default", a field name, or a JSON array. Single-field queries bind ?field to the searched field name as xsd:string.

Add spatial filtering with bbox and polygon support

* Return typed values and field IRIs from luc:query and luc:facet (#31, #34) - ?field binding returns a URI (auto-generated urn:jena:lucene:index#field/{name} for blank node fields, or the config resource IRI for named fields) - ?literal binding populated from Lucene stored values with type determined by field type: KEYWORD with URI values → IRI node, TEXT → string literal, numeric → typed literal. Non-URI KEYWORD values fall back to string literals. - luc:facet ?value follows the same type mapping - Demo data migrated from string literals to IRIs for KEYWORD fields (commodity, state, operator, status) - Demo app updated to handle URI-typed facet values and field bindings * Fix invalid Turtle syntax in demo data and add parsing tests Use dedicated prefixes (commodity:, state:, operator:, status:) instead of slashes in prefixed local names (ex:commodity/Copper) which is invalid Turtle syntax. Add TestDemoDataParsing to validate demo data files parse correctly as a regression guard. * Change field IRI prefix to urn:jena:lucene:field#{name} The previous prefix urn:jena:lucene:index#field/{name} caused shortName() extraction to return "field/{name}" instead of just "{name}" because # is parsed before /. The new scheme allows standard URI local-name extraction to work correctly. * Add minResults assertions to demo test cases Update tests.json with expected minimum result counts for each test case and IRI-encoded facet values. Playwright spec now asserts resultCount >= minResults when specified. * Fix race condition in search and improve error handling Add AbortController to cancel in-flight SPARQL requests when a new search starts, preventing stale responses from causing errors during rapid facet toggling. Fix error catch to only show "Cannot connect" for actual network errors, not TypeErrors from JS code bugs. * Use CQL filter param for URL state instead of individual facet params Replace per-field URL params (commodity=X&state=Y&bbox=...) with a single ?filter= param containing the CQL2-JSON string. Simplifies URL handling — buildCqlFilter serializes state to URL, parseCqlFilter deserializes back. Also fix WKT map marker parsing (was passing property value object instead of raw string) and remove debug logging. * Fix Leaflet fitBounds animation error on hidden map Disable animation on fitBounds to prevent TypeError when the map container is not fully rendered during search updates. * Add SPARQL editor, CQL viewer, named field IRIs, and UI improvements - SPARQL editor popup: click log entries to open, editable endpoint, table/JSON result views, drag/resize, click-outside-to-close - CQL viewer popup: collapsible JSON tree with expand/collapse all, object/raw toggle - Named field IRIs: config uses named resources instead of blank nodes, field IRI column added to config page, raw TTL view toggle - entityType field changed from sequence path (rdf:type rdfs:label) to direct rdf:type path for IRI-based faceting - Spatial overlays set to interactive:false so map markers stay clickable - Log renamed from "SPARQL Log" to "Log", CQL filter entries added - Active pill color changed to steel blue - Test dropdown URLs properly percent-encoded - Screenshot baselines updated for all UI changes * Hide non-facetable literal values from result card pills Skip properties that are not facet-mapped and contain only literal values (e.g., depth, year) from rendering as tag pills on result cards. * Fix stats page facet breakdown with named field IRIs Extract resolveFieldName as shared utility function and use it in statsApp to resolve field URIs back to field names. * Show short names for facet values on stats page * Add short/full name toggle on stats page facet breakdown * Rename stats toggle buttons to Short Name / Full Name * Use field IRIs instead of field names in luc:query and luc:facet APIs Resolve field IRIs to Lucene field names in ShaclTextIndexLucene so that queries and facet requests can use IRIs as field identifiers. ShaclIndexMapping.findField() now accepts both field names and IRIs, matching by local name for IRI lookups. * Update docs for field IRIs, CQL2-JSON filters, spatial, and paths - Move inverse/sequence paths and spatial filtering from Proposed to Done - Update all filter examples from old JSON format to CQL2-JSON syntax - Add named field resource config examples showing field IRI preservation - Add sequence and inverse path examples to configuration reference - Add LatLonField to field types table with link to spatial docs - Add forward chaining replacement section to use cases - Add field IRI and spatial entries to feature status table and roadmap - Update Classic vs SHACL comparison table * Remove classic mode references from documentation Focus docs on SHACL mode only. Classic mode (text:entityMap / text:query) is upstream Jena — a brief note points readers to Apache Jena docs. Removes duplicate config examples, comparison tables, and dual-mode framing that added noise without value. * Enforce field IRIs as sole identifier in SPARQL APIs and docs findField() now matches exact field IRIs only; findFieldByName() added for internal Lucene field name lookups. All external-facing APIs (luc:query field specs, luc:facet facet arrays, CQL2-JSON property values, sort specs) require field IRIs. Demo config migrated to absolute PREFIX field: <urn:jena:lucene:field#>. All documentation updated to use field IRIs consistently.

* refactor(demo): unify on port 3030 and simplify Docker workflow - Update Taskfile and config to use port 3030 for all mining tasks - Switch Docker build to multi-stage in-container Maven builds - Add docker-start-ghcr and docker-serve-ghcr for remote images - Improve Taskfile dependency management with task dependencies - Update documentation and add .dockerignore for cleaner builds * Add GitHub Actions workflow for Docker image builds Builds and pushes fuseki-ai and fuseki-loader images to GHCR on push to main and on manual dispatch. Builds multi-arch (amd64 + arm64) with GHA build cache. Uses GITHUB_TOKEN for auth. * Make loader image extend fuseki-ai instead of duplicating build Loader Dockerfile now uses fuseki-ai as base image and just adds entrypoint.sh. Eliminates duplicate Maven build — loader-build depends on image-build and layers on top. Updated GH Actions workflow to pass base image args. Updated entrypoint.sh jar path to match fuseki image layout. --------- Co-authored-by: aiworkerjohns <aiworker.johns@gmail.com>

* Support wildcard ["*"] in luc:facet to request all facetable fields When ["*"] is passed as the facet field list, resolveFacetFieldNames() expands it to all configured facet fields, avoiding the need for clients to enumerate every field IRI explicitly. * Add integration tests for demo mining scenarios 22 tests covering multi-entity-type indexing, facet wildcard, CQL filters, sequence paths, and combined query+facet patterns modelled on the demo mining dataset. --------- Co-authored-by: aiworkerjohns <aiworker.johns@gmail.com>

The loader Docker build needs demo/loader/entrypoint.sh but it was excluded by the blanket demo/ ignore rule.

- Make FUSEKI_PORT and APP_PORT configurable (defaults: 3030, 8000) - Add serve_app.py reverse proxy for CORS-free frontend development - Add task refresh: one-command clean/restart/reload/serve workflow - Update task app to use proxy instead of simple HTTP server

* Add per-field query analyzer support (idx:queryAnalyzer) Allow fields to specify separate analyzers for indexing and querying. This enables patterns like edge n-gram indexing with keyword querying for prefix/typeahead search on identifier fields. * Add EdgeNGramAnalyzer and demo identifier field with prefix search - Add text:EdgeNGramAnalyzer assembler type for prefix/typeahead indexing - Add ex:identifier field to demo config with idx:queryAnalyzer example - Add identifier data to all mining.ttl entities (sites, boreholes, reports) - Add assembler test verifying idx:queryAnalyzer wiring * Improve identifier prefix search demo * Fix review issues and add coverage tests - Fix LowerCaseFilter import (Lucene 10 compatibility) - Add log.warn when global text:queryAnalyzer shadows per-field overrides - Revert default ports to 3030/8000 (keep parameterization) - Add 4 tests: normal field alongside prefix field, field isolation, FieldDef without queryAnalyzer fallback, EntityDefinition wiring --------- Co-authored-by: aiworkerjohns <aiworker.johns@gmail.com>

recalcitrantsupplant · 2026-03-31T01:17:15Z

apologies, wrong branch again

afs · 2026-03-31T07:56:26Z

Would you mind unlinking your repo so that it isn't a fork? You can still have Jena as an upstream repo and pull changes.

recalcitrantsupplant · 2026-04-01T07:06:53Z

the repo has been unlinked now

aiworkerjohns and others added 30 commits January 15, 2026 13:11

Add comprehensive build fix guide for faceting implementation testing

223ac9d

add project status

268ab0a

Add completion of initial faceting functionality

529445b

review

bbbf72a

review

bae039d

Merge branch 'apache:main' into main

c8227b7

Merge pull request #1 from aiworkerjohns/david/review

c68e05d

David/review

Merge branch 'apache:main' into main

92cd78f

Add change summary to docs README and CLAUDE.md for repo guidance

bbb452e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add repository safeguard to CLAUDE.md

b76b7bf

Explicit instruction to only create issues/PRs on the fork (aiworkerjohns/jena), never on upstream (apache/jena). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update 08-use-cases.md

7a65b95

update example

Add Docker build, push, and compose for demo Fuseki image

0ece5c0

Add Dockerfile, docker-compose.yml, and Taskfile tasks for building and pushing the demo image to GitHub CR or Azure CR. Includes DockerReadme.md documenting the full workflow.

Consolidate Docker docs into README, fix license header

9040a63

Merge DockerReadme.md content into README.md covering image build, GHCR/ACR push, and docker compose workflows. Add missing Apache license header to .dockerignore.

aiworkerjohns and others added 26 commits March 3, 2026 18:49

Add MODE support to loader and document GHCR auth requirements

8233b33

Loader entrypoint now supports MODE=all|load|index to run steps independently. Added Apache license headers to loader files. Documented GHCR auth account requirement in CLAUDE.md.

Merge pull request #25 from aiworkerjohns/bulk-reindexer

b317876

Add SHACL bulk reindexer and offline loader Docker image

Log warning when CQL filter has non-pushable residual expressions

3bbc31a

Separate SHACL assembler from classic TextIndexLucene assembler

99e4df2

Add text:TextIndexShacl RDF type with dedicated ShaclTextIndexAssembler, removing SHACL branching from TextIndexLuceneAssembler. Each assembler now handles exactly one mode — classic entityMap or SHACL shapes.

Extract ShaclTextIndexLucene from TextIndexLucene

b0ad512

Move all SHACL entity-per-document logic (faceting, CQL filters, sort pushdown, document building) into ShaclTextIndexLucene subclass. TextIndexLucene now contains only classic triple-per-document code with zero isShaclMode() branching.

Merge pull request #27 from aiworkerjohns/cql-faceting

8f19425

CQL2-JSON filters, multi-index registry, and sort pushdown

Add Playwright screenshot tests for mining demo

7a55762

Data-driven from tests.json — 34 test cases across 11 groups covering full-text search, faceted filtering, spatial bbox, and combined queries. Generates a Markdown report with embedded screenshots.

Add Playwright screenshot test results (34 test cases)

4f6b4ec

Update SPARQL API docs for field spec syntax and ?field binding

6d4f376

Keep map square on resize with aspect-ratio instead of fixed height

dc8d0ed

Update screenshot baselines after map aspect-ratio change

5a2f61a

Build Docker images for linux/amd64 platform

e8f1e8b

Merge pull request #33 from aiworkerjohns/spatial-filtering

5d070ed

Add spatial filtering with bbox and polygon support

fix: whitelist loader entrypoint in .dockerignore (#41)

766707e

The loader Docker build needs demo/loader/entrypoint.sh but it was excluded by the blanket demo/ ignore rule.

Return a matching raw value for multi-valued luc:query fields

2d1122d

Add multi-value identifier demo query

92e5dee

recalcitrantsupplant closed this Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return matching raw values for multi-valued luc:query fields#3823

Return matching raw values for multi-valued luc:query fields#3823
recalcitrantsupplant wants to merge 64 commits intoapache:mainfrom
aiworkerjohns:fix/lucene-matchraw-multivalue

recalcitrantsupplant commented Mar 31, 2026

Uh oh!

recalcitrantsupplant commented Mar 31, 2026

Uh oh!

afs commented Mar 31, 2026

Uh oh!

recalcitrantsupplant commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

recalcitrantsupplant commented Mar 31, 2026

Summary

Testing

Uh oh!

recalcitrantsupplant commented Mar 31, 2026

Uh oh!

afs commented Mar 31, 2026

Uh oh!

recalcitrantsupplant commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants