Skip to content

feat: Add New core integration tests#1194

Draft
edwinjosechittilappilly wants to merge 11 commits intomainfrom
new-core-integration-tests
Draft

feat: Add New core integration tests#1194
edwinjosechittilappilly wants to merge 11 commits intomainfrom
new-core-integration-tests

Conversation

@edwinjosechittilappilly
Copy link
Collaborator

This pull request adds comprehensive integration test sample data and utilities to support format ingestion testing. It introduces a script to generate minimal sample files for a variety of document formats (both binary and text), commits these sample files to the repository, and provides new shared helpers for integration tests. These changes enable robust, automated testing of document ingestion pipelines across multiple formats.

Test Sample Data Generation and Files:

  • Added create_samples.py script to tests/data/ that generates minimal, valid sample files for binary formats (PDF, DOCX, XLSX, PPTX) and writes standard sample files for text formats (Markdown, AsciiDoc, LaTeX, HTML, XHTML, CSV). This script uses only the Python standard library and ensures consistent, up-to-date test data.
  • Committed generated sample files for the following formats in tests/data/samples/:
    • AsciiDoc (sample.adoc)
    • Markdown (sample.md)
    • LaTeX (sample.tex)
    • HTML (sample.html)
    • XHTML (sample.xhtml)
    • CSV (sample.csv)

Integration Test Utilities:

  • Added helpers.py in tests/integration/core/ with shared async helpers for integration tests, including:
    • boot_app: Boots a fresh in-process FastAPI app with configurable settings and index cleanup.
    • wait_for_task_completion: Polls for async task completion.
    • wait_for_indexed: Waits until a search query returns results.
    • is_docling_available: Checks if the docling-serve service is reachable (required for binary format tests).
      These utilities improve test reliability and reduce boilerplate in integration tests.

Add integration test suite and supporting sample files/utilities.

- tests/data/create_samples.py: script to generate minimal PDF/DOCX/XLSX/PPTX samples (Python stdlib) and write to tests/data/samples/.
- tests/data/samples/*: pre-generated binary sample files used by tests.
- tests/integration/core/helpers.py: shared test helpers (boot_app, HTTPX ASGI client, wait_for_task_completion, wait_for_indexed, is_docling_available).
- tests/integration/core/test_document_lifecycle.py: tests for document endpoints (check-filename, delete-by-filename, upload_path) and full upload/delete lifecycles.
- tests/integration/core/test_file_format_ingestion.py: parametrized ingestion tests across text and binary formats; skips docling-dependent cases when docling-serve is unavailable.
- tests/integration/core/test_settings_and_tasks.py: tests for settings endpoints and task lifecycle (list, status, cancel, upload_path-created tasks).

Tests run against an in-process FastAPI app with live OpenSearch; create_samples.py can be re-run to regenerate sample files.
Add several committed text sample files (md, adoc, html, xhtml, tex, csv) under tests/data/samples and update binary samples. Refactor tests/data/create_samples.py to separate binary_formats and text_formats: generate/write binary content as bytes and write textual samples using write_text for consistency. Update tests/integration/core/test_file_format_ingestion.py to reference committed sample files (SAMPLES_DIR/...) for text/docling-served formats instead of embedding inline content, and keep binary formats as pre-generated samples. This centralizes sample data and makes ingestion tests consistently use filesystem fixtures.
@github-actions github-actions bot added tests enhancement 🔵 New feature or request labels Mar 19, 2026
Replace calls to clients.close() with clients.cleanup() in test teardown and helpers to use the updated clients API and ensure proper cleanup of global clients (avoids aiohttp warnings). Updates docstring in helpers to instruct callers to call clients.cleanup(). Affected files: tests/conftest.py, tests/integration/core/helpers.py, and multiple test modules under tests/integration/core/*.
@github-actions github-actions bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Mar 19, 2026
Create config dir in CI startup and update integration tests to use the production Langflow ingestion path.

- Makefile: ensure config/ exists (mkdir -p + chmod 777) before bringing up infra in test-ci and test-ci-local.
- tests/integration/core/helpers.py: add is_langflow_available() helper to detect Langflow health.
- tests/integration/core/test_file_format_ingestion.py: switch tests to the Langflow ingestion flow (boot app with Langflow enabled), skip entire test when Langflow is not running, assert uploads return 202 with a task_id, poll tasks until completion (longer timeout), adjust search/skip logic for docling-dependent formats, and update docstrings and messages to reflect the Langflow path.

These changes make the integration tests exercise the real production upload pipeline and avoid false failures when Langflow or docling-serve are unavailable.
@github-actions github-actions bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Mar 19, 2026
Pre-build a fallback search body that omits `num_candidates` for OpenSearch versions that don't support that field and use it when a RequestError occurs instead of relying on fragile error-string matching. Update logging messages to be clearer and less dependent on specific error text, and ensure disk-space errors still raise OpenSearchDiskSpaceError on both initial and retry attempts. Remove verbose inclusion of the search body in error logs and tidy related log messages.
@github-actions github-actions bot added backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Mar 19, 2026
Replace search-based verification with a direct OpenSearch index check (/documents/check-filename) in the file ingestion integration test. This avoids relying on search/KNN/embedding behavior, removes the special-case fallback for binary formats, tightens assertions and error messages, and updates docstrings and logging to reflect that the test asserts indexing rather than searchability.
@github-actions github-actions bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Mar 19, 2026
Add a session-scoped ingestion_report fixture and a pytest_terminal_summary hook (tests/integration/core/conftest.py) to accumulate and print a formatted per-format ingestion report. Update test_file_format_ingestion.py to record PASSED/FAILED/SKIPPED outcomes into the shared fixture instead of hard-failing, add explicit stepwise error handling for file preparation, upload, task polling, and index verification, and include helpers (_record_failure, _print_result). Priority formats (pdf, docx, html) are highlighted and skip reasons are recorded so all formats run and a summary is shown at the end of the test session.
@github-actions github-actions bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Mar 20, 2026
@github-actions github-actions bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Mar 20, 2026
Remove the old core integration file-format tests and conftest, and add an SDK-based ingestion test suite. Deleted tests/integration/core/conftest.py and tests/integration/core/test_file_format_ingestion.py. Added a session-scoped ingestion_report and pytest_terminal_summary to tests/integration/sdk/conftest.py, and introduced tests/integration/sdk/test_file_format_ingestion.py which exercises ingestion via the SDK against a running OpenRAG instance. The new tests use client.documents.ingest(wait=True) as the ground truth (successful_files > 0), honor skip conditions (OpenRAG or docling-serve unavailable), and record per-format PASSED/FAILED/SKIPPED results. Priority formats (pdf, docx, html) are highlighted in the final report.
@github-actions github-actions bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Mar 20, 2026
Add a session-scoped require_openrag fixture that probes OpenRAG /health and skips the entire test session if the service is not reachable. Make the existing ensure_onboarding fixture depend on this check. Remove duplicated availability helper, environment URL, and related imports from test_file_format_ingestion.py so tests no longer perform redundant reachability checks and configuration is centralized in conftest.py.
@github-actions github-actions bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Mar 20, 2026
Makefile: wait for OpenRAG API readiness at /api/health (both test-ci and test-ci-local), add a retry loop that verifies the health response, print backend/frontend logs and stop services/tear down when readiness fails to aid debugging. tests/data/create_samples.py: rewrite PDF generator to produce correct xref offsets and return bytes; expand PPTX generator to accept a title and multiple body paragraphs, add missing relationships, slide layout, and proper XML namespaces so produced PPTX is more robustly parsable. tests/integration/sdk/conftest.py: probe /api/health instead of the root URL. Binary sample fixtures (docx, pdf, pptx, xlsx) were regenerated to match the updated generators.
@github-actions github-actions bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) enhancement 🔵 New feature or request tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant