feat: Add New core integration tests#1194
Draft
edwinjosechittilappilly wants to merge 11 commits intomainfrom
Draft
feat: Add New core integration tests#1194edwinjosechittilappilly wants to merge 11 commits intomainfrom
edwinjosechittilappilly wants to merge 11 commits intomainfrom
Conversation
Add integration test suite and supporting sample files/utilities. - tests/data/create_samples.py: script to generate minimal PDF/DOCX/XLSX/PPTX samples (Python stdlib) and write to tests/data/samples/. - tests/data/samples/*: pre-generated binary sample files used by tests. - tests/integration/core/helpers.py: shared test helpers (boot_app, HTTPX ASGI client, wait_for_task_completion, wait_for_indexed, is_docling_available). - tests/integration/core/test_document_lifecycle.py: tests for document endpoints (check-filename, delete-by-filename, upload_path) and full upload/delete lifecycles. - tests/integration/core/test_file_format_ingestion.py: parametrized ingestion tests across text and binary formats; skips docling-dependent cases when docling-serve is unavailable. - tests/integration/core/test_settings_and_tasks.py: tests for settings endpoints and task lifecycle (list, status, cancel, upload_path-created tasks). Tests run against an in-process FastAPI app with live OpenSearch; create_samples.py can be re-run to regenerate sample files.
Add several committed text sample files (md, adoc, html, xhtml, tex, csv) under tests/data/samples and update binary samples. Refactor tests/data/create_samples.py to separate binary_formats and text_formats: generate/write binary content as bytes and write textual samples using write_text for consistency. Update tests/integration/core/test_file_format_ingestion.py to reference committed sample files (SAMPLES_DIR/...) for text/docling-served formats instead of embedding inline content, and keep binary formats as pre-generated samples. This centralizes sample data and makes ingestion tests consistently use filesystem fixtures.
Replace calls to clients.close() with clients.cleanup() in test teardown and helpers to use the updated clients API and ensure proper cleanup of global clients (avoids aiohttp warnings). Updates docstring in helpers to instruct callers to call clients.cleanup(). Affected files: tests/conftest.py, tests/integration/core/helpers.py, and multiple test modules under tests/integration/core/*.
Create config dir in CI startup and update integration tests to use the production Langflow ingestion path. - Makefile: ensure config/ exists (mkdir -p + chmod 777) before bringing up infra in test-ci and test-ci-local. - tests/integration/core/helpers.py: add is_langflow_available() helper to detect Langflow health. - tests/integration/core/test_file_format_ingestion.py: switch tests to the Langflow ingestion flow (boot app with Langflow enabled), skip entire test when Langflow is not running, assert uploads return 202 with a task_id, poll tasks until completion (longer timeout), adjust search/skip logic for docling-dependent formats, and update docstrings and messages to reflect the Langflow path. These changes make the integration tests exercise the real production upload pipeline and avoid false failures when Langflow or docling-serve are unavailable.
Pre-build a fallback search body that omits `num_candidates` for OpenSearch versions that don't support that field and use it when a RequestError occurs instead of relying on fragile error-string matching. Update logging messages to be clearer and less dependent on specific error text, and ensure disk-space errors still raise OpenSearchDiskSpaceError on both initial and retry attempts. Remove verbose inclusion of the search body in error logs and tidy related log messages.
Replace search-based verification with a direct OpenSearch index check (/documents/check-filename) in the file ingestion integration test. This avoids relying on search/KNN/embedding behavior, removes the special-case fallback for binary formats, tightens assertions and error messages, and updates docstrings and logging to reflect that the test asserts indexing rather than searchability.
Add a session-scoped ingestion_report fixture and a pytest_terminal_summary hook (tests/integration/core/conftest.py) to accumulate and print a formatted per-format ingestion report. Update test_file_format_ingestion.py to record PASSED/FAILED/SKIPPED outcomes into the shared fixture instead of hard-failing, add explicit stepwise error handling for file preparation, upload, task polling, and index verification, and include helpers (_record_failure, _print_result). Priority formats (pdf, docx, html) are highlighted and skip reasons are recorded so all formats run and a summary is shown at the end of the test session.
Remove the old core integration file-format tests and conftest, and add an SDK-based ingestion test suite. Deleted tests/integration/core/conftest.py and tests/integration/core/test_file_format_ingestion.py. Added a session-scoped ingestion_report and pytest_terminal_summary to tests/integration/sdk/conftest.py, and introduced tests/integration/sdk/test_file_format_ingestion.py which exercises ingestion via the SDK against a running OpenRAG instance. The new tests use client.documents.ingest(wait=True) as the ground truth (successful_files > 0), honor skip conditions (OpenRAG or docling-serve unavailable), and record per-format PASSED/FAILED/SKIPPED results. Priority formats (pdf, docx, html) are highlighted in the final report.
Add a session-scoped require_openrag fixture that probes OpenRAG /health and skips the entire test session if the service is not reachable. Make the existing ensure_onboarding fixture depend on this check. Remove duplicated availability helper, environment URL, and related imports from test_file_format_ingestion.py so tests no longer perform redundant reachability checks and configuration is centralized in conftest.py.
Makefile: wait for OpenRAG API readiness at /api/health (both test-ci and test-ci-local), add a retry loop that verifies the health response, print backend/frontend logs and stop services/tear down when readiness fails to aid debugging. tests/data/create_samples.py: rewrite PDF generator to produce correct xref offsets and return bytes; expand PPTX generator to accept a title and multiple body paragraphs, add missing relationships, slide layout, and proper XML namespaces so produced PPTX is more robustly parsable. tests/integration/sdk/conftest.py: probe /api/health instead of the root URL. Binary sample fixtures (docx, pdf, pptx, xlsx) were regenerated to match the updated generators.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request adds comprehensive integration test sample data and utilities to support format ingestion testing. It introduces a script to generate minimal sample files for a variety of document formats (both binary and text), commits these sample files to the repository, and provides new shared helpers for integration tests. These changes enable robust, automated testing of document ingestion pipelines across multiple formats.
Test Sample Data Generation and Files:
create_samples.pyscript totests/data/that generates minimal, valid sample files for binary formats (PDF, DOCX, XLSX, PPTX) and writes standard sample files for text formats (Markdown, AsciiDoc, LaTeX, HTML, XHTML, CSV). This script uses only the Python standard library and ensures consistent, up-to-date test data.tests/data/samples/:sample.adoc)sample.md)sample.tex)sample.html)sample.xhtml)sample.csv)Integration Test Utilities:
helpers.pyintests/integration/core/with shared async helpers for integration tests, including:boot_app: Boots a fresh in-process FastAPI app with configurable settings and index cleanup.wait_for_task_completion: Polls for async task completion.wait_for_indexed: Waits until a search query returns results.is_docling_available: Checks if the docling-serve service is reachable (required for binary format tests).These utilities improve test reliability and reduce boilerplate in integration tests.