Skip to content

Conversation

@aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Oct 21, 2025

feat: Add metadata models package with dynamic schema download

Summary

This PR adds a new airbyte_cdk.metadata_models package containing auto-generated Pydantic models for validating connector metadata.yaml files. The models are generated from JSON Schema YAML files maintained in the airbytehq/airbyte repository.

Key changes:

  • Added airbyte_cdk/metadata_models/ package with generated Pydantic models (~3400 lines)
  • Extended bin/generate_component_manifest_files.py to download schemas from GitHub on-demand during build
  • Models use pydantic.v1 compatibility layer for consistency with existing declarative component models
  • Includes comprehensive README with usage examples
  • Added py.typed marker for type hint compliance

Usage example:

from airbyte_cdk.metadata_models import ConnectorMetadataDefinitionV0
import yaml

metadata = ConnectorMetadataDefinitionV0(**yaml.safe_load(metadata_yaml))

Review & Testing Checklist for Human

This is a yellow risk PR - the implementation is straightforward but has network dependencies and generated code that need verification:

  • Test end-to-end validation: Try importing and validating a real connector's metadata.yaml file to ensure the generated models work correctly
  • Verify build in CI: Run poetry run poe build in CI to ensure it works in that environment (may hit GitHub rate limits)
  • Consider GitHub authentication: The current implementation makes unauthenticated requests to GitHub API, which hit rate limits during development. Consider adding authentication or caching mechanism for production use

Notes

Requested by: AJ Steers (aj@airbyte.io) / @aaronsteers
Devin session: https://app.devin.ai/sessions/9b487ed33f5842c087a1de30d33db888

Known limitations:

  • GitHub rate limiting: Build process downloads schemas from GitHub without authentication, which can hit rate limits if run frequently (we encountered this during testing). Consider adding caching or GitHub token authentication for CI.
  • Network dependency: Build requires network access to download schemas. No fallback mechanism if GitHub is unavailable.
  • Generated code testing: The generated models have only been tested with a basic import. Comprehensive validation testing with real metadata.yaml files is recommended.

Why this approach:

  • Leverages existing datamodel-code-generator infrastructure used for declarative component models
  • Keeps schemas in airbytehq/airbyte repo as the single source of truth
  • No need for separate package management or CI/CD workflows
  • Models automatically stay in sync with schema updates during builds

Summary by CodeRabbit

  • New Features

    • Connector metadata models and two public model types are now exposed for integration.
    • Added a utility to generate connector metadata artifacts as part of the build.
  • Documentation

    • Guidance on regenerating connector metadata models from schema sources.
  • Chores

    • Automated generation of connector metadata artifacts and declarative manifests in the build.
    • Expanded linter, pre-commit, and type-checker exclusions for generated artifacts; updated build tasks.
  • Style

    • Formatting adjustments to declarative model declarations.

- Add airbyte_cdk.metadata_models package with auto-generated Pydantic models
- Models are generated from JSON schemas in airbytehq/airbyte repository
- Schemas are downloaded on-demand during build process (no submodules)
- Uses pydantic.v1 compatibility layer for consistency with declarative models
- Includes comprehensive README with usage examples
- Adds py.typed marker for type hint compliance

The metadata models enable validation of connector metadata.yaml files
using the same schemas maintained in the main Airbyte repository.

Co-Authored-By: AJ Steers <aj@airbyte.io>
Copilot AI review requested due to automatic review settings October 21, 2025 00:39
@devin-ai-integration
Copy link
Contributor

Original prompt from AJ Steers
Received message in Slack channel #ask-devin-ai:

@Devin Can you research to see if anyone has started or worked on a json schema file or other validation mechanism (pydantic?) for metadata.yaml files?
Thread URL: https://airbytehq-team.slack.com/archives/C08BHPUMEPJ/p1760999875625349?thread_ts=1760999875.625349

Quote of conversation (https://airbytehq-team.slack.com/archives/C02U9R3AF37/p1760999798826039?thread_ts=1760978308.303779&cid=C02U9R3AF37):
> From AJ Steers
> What would give me more confidence on the above personally would be to have a pydantic model and/or JSON schema we could use to validate metadata.yaml files. Then we could assert via static analysis of all files that the new config is "valid" across hundreds of connectors very quickly.
> Posted on October 20, 2025 at 10:36 PM

@devin-ai-integration
Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new airbyte_cdk.metadata_models package containing auto-generated Pydantic models for validating connector metadata.yaml files. The models are dynamically generated during build by downloading JSON schemas from the airbytehq/airbyte repository.

Key changes:

  • New metadata_models package with auto-generated Pydantic validation models
  • Build script extension to download schemas from GitHub during build process
  • Comprehensive README with usage examples and documentation

Reviewed Changes

Copilot reviewed 3 out of 35 changed files in this pull request and generated no comments.

File Description
airbyte_cdk/metadata_models/py.typed Marker file for PEP 561 type hint compliance
airbyte_cdk/metadata_models/init.py Package initialization re-exporting generated models
airbyte_cdk/metadata_models/README.md Documentation covering usage, available models, and regeneration process

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@github-actions github-actions bot added the enhancement New feature or request label Oct 21, 2025
@github-actions
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1760999875-add-metadata-models#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1760999875-add-metadata-models

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 21, 2025

📝 Walkthrough

Walkthrough

Adds tooling and packaging to generate and expose Airbyte connector metadata Pydantic models: new generation script, consolidated JSON schema output, package re-exports for generated models, docs for regeneration, build-task updates, and lint/type-ignore rules for generated artifacts.

Changes

Cohort / File(s) Summary
Package Exports
airbyte_cdk/test/models/connector_metadata/__init__.py, airbyte_cdk/test/models/__init__.py
New package init re-exporting ConnectorMetadataDefinitionV0 and ConnectorTestSuiteOptions from generated models; parent models module updated to surface them publicly.
Generation Script
bin/generate_connector_metadata_files.py
New script to sparse-clone Airbyte schema YAMLs, consolidate YAMLs into a unified metadata_schema.json, run datamodel-codegen to produce a single models.py, post-process imports to pydantic.v1, and write artifacts to output dir.
Build Tasks
pyproject.toml
assemble replaced by a sequence of new tasks assemble-declarative and assemble-metadata; assemble now runs both tasks and invokes the new metadata generator.
Docs
docs/CONTRIBUTING.md
Added "Regenerating Connector Metadata Models" section describing source schemas, generation steps, outputs, and a usage example.
Lint / Type Exclusions
.pre-commit-config.yaml, ruff.toml, mypy.ini
Added exclude patterns and mypy per-target ignore settings to bypass linting/type-checking for generated files (declarative_component_schema.py and connector_metadata/generated/models.py).
Formatting
airbyte_cdk/sources/declarative/models/declarative_component_schema.py
Non-functional formatting changes to Field(...) calls (parenthesization/line-wrapping) only; no behavioral changes.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Script as generate_connector_metadata_files.py
    participant GitHub as Remote Schema Repo
    participant Generator as datamodel-codegen (container)
    participant FS as Output Files

    User->>Script: run main()
    Script->>GitHub: clone_schemas_from_github (sparse clone)
    GitHub-->>Script: YAML schema files
    Script->>Script: consolidate_yaml_schemas_to_json
    Script->>Generator: generate_models_from_json_schema (datamodel-codegen)
    Generator-->>Script: generated Python content
    Script->>Script: post-process imports (pydantic.v1) & merge
    Script->>FS: write generated/models.py and metadata_schema.json
    FS-->>User: artifacts ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45–50 minutes

  • Areas to review closely:
    • schema consolidation and $ref rewrite logic in bin/generate_connector_metadata_files.py
    • datamodel-codegen invocation and post-processing to ensure valid pydantic.v1 models
    • mypy/ruff/pre-commit exclude patterns for precision (avoid overbroad matches)
    • new public exports for potential import cycles or packaging issues
    • build task changes in pyproject.toml and script invocation correctness

Would you like me to call out any of these files in more detail for review? wdyt?

Suggested reviewers

  • maxi297
  • pnilan
  • dbgold17
  • bazarnov

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 64.71% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "feat: Add metadata models package with dynamic schema download" directly describes the primary change in the changeset. The title clearly identifies the main feature being added—a new metadata models package—and the key mechanism enabling it (dynamic schema download from GitHub). The title is concise, uses conventional commit format, and contains no vague or noisy terms. While the changeset includes supporting changes like configuration updates and documentation, the title appropriately focuses on the core feature, which aligns with the principle that titles don't need to cover every detail. A teammate scanning the history would immediately understand that this PR introduces a new models package with automated schema fetching capabilities.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1760999875-add-metadata-models

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
airbyte_cdk/metadata_models/__init__.py (1)

1-1: Consider adding an explicit __all__ for API clarity, wdyt?

While wildcard imports are acceptable for package __init__.py files, defining an explicit __all__ list would make the public API surface more obvious and give you finer control over what gets exported. This is especially useful for generated code where you might want to filter out internal types or helpers.

Example approach:

+__all__ = [
+    "ConnectorMetadataDefinitionV0",
+    # Add other public models as needed
+]
+
 from .generated import *

Alternatively, you could generate the __all__ list in generated.py itself during the build process.

airbyte_cdk/metadata_models/README.md (1)

37-47: Second example assumes undefined metadata_dict, wdyt?

The second usage example references metadata_dict on line 40, but this variable isn't defined in that code block. While readers might understand it comes from the first example, making this example standalone (or explicitly noting it continues from the first) would improve clarity.

Consider either:

  1. Making it self-contained:
from airbyte_cdk.metadata_models import ConnectorMetadataDefinitionV0
import yaml

metadata_dict = yaml.safe_load(Path("path/to/metadata.yaml").read_text())
metadata = ConnectorMetadataDefinitionV0(**metadata_dict)

# Access fields with full type safety
print(f"Connector: {metadata.data.name}")
# ...
  1. Or adding a comment linking to the first example:
# Continuing from the validation example above...
from airbyte_cdk.metadata_models import ConnectorMetadataDefinitionV0

metadata = ConnectorMetadataDefinitionV0(**metadata_dict)
# ...
bin/generate_component_manifest_files.py (2)

26-27: Consider consolidating the YAML file discovery functions, wdyt?

Both get_all_yaml_files_without_ext() and get_all_yaml_files_from_dir() perform the same operation, with the new one being more parameterized. The old function is still used on line 237 for declarative models.

You could simplify by using the new parameterized version everywhere:

-def get_all_yaml_files_without_ext() -> list[str]:
-    return [Path(f).stem for f in glob(f"{LOCAL_YAML_DIR_PATH}/*.yaml")]
-
-
 def get_all_yaml_files_from_dir(directory: str) -> list[str]:
     return [Path(f).stem for f in glob(f"{directory}/*.yaml")]

Then update line 237:

-        declarative_yaml_files = get_all_yaml_files_without_ext()
+        declarative_yaml_files = get_all_yaml_files_from_dir(LOCAL_YAML_DIR_PATH)

Also applies to: 30-32


160-179: Consider extracting common post-processing logic, wdyt?

There's code duplication between post_process_metadata_models and post_process_codegen (lines 136-157). Both functions:

  • Create /generated_post_processed directory
  • Iterate through .py files
  • Replace pydantic imports
  • Write to new files

You could extract the common pattern:

async def apply_post_processing(
    codegen_container: dagger.Container,
    transform_fn: callable[[str], str]
) -> dagger.Container:
    """Apply a transformation function to all generated Python files."""
    codegen_container = codegen_container.with_exec(
        ["mkdir", "/generated_post_processed"], use_entrypoint=True
    )
    for generated_file in await codegen_container.directory("/generated").entries():
        if generated_file.endswith(".py"):
            original_content = await codegen_container.file(
                f"/generated/{generated_file}"
            ).contents()
            
            post_processed_content = transform_fn(original_content)
            
            codegen_container = codegen_container.with_new_file(
                f"/generated_post_processed/{generated_file}", 
                contents=post_processed_content
            )
    return codegen_container

async def post_process_metadata_models(codegen_container: dagger.Container):
    """Post-process metadata models to use pydantic.v1 compatibility layer."""
    def transform(content: str) -> str:
        return content.replace("from pydantic", "from pydantic.v1")
    return await apply_post_processing(codegen_container, transform)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b747fe and ff01ea0.

⛔ Files ignored due to path filters (31)
  • airbyte_cdk/metadata_models/generated/ActorDefinitionResourceRequirements.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/AirbyteInternal.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/AllowedHosts.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorBreakingChanges.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorBuildOptions.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorIPCOptions.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorMetadataDefinitionV0.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorMetrics.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorPackageInfo.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorRegistryDestinationDefinition.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorRegistryReleases.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorRegistrySourceDefinition.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorRegistryV0.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorReleases.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ConnectorTestSuiteOptions.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/GeneratedFields.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/GitInfo.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/JobType.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/NormalizationDestinationDefinitionConfig.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/RegistryOverrides.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ReleaseStage.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/RemoteRegistries.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/ResourceRequirements.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/RolloutConfiguration.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/Secret.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/SecretStore.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/SourceFileInfo.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/SuggestedStreams.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/SupportLevel.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/TestConnections.py is excluded by !**/generated/**
  • airbyte_cdk/metadata_models/generated/__init__.py is excluded by !**/generated/**
📒 Files selected for processing (4)
  • airbyte_cdk/metadata_models/README.md (1 hunks)
  • airbyte_cdk/metadata_models/__init__.py (1 hunks)
  • airbyte_cdk/metadata_models/py.typed (1 hunks)
  • bin/generate_component_manifest_files.py (3 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
airbyte_cdk/metadata_models/README.md

8-8: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (3)
airbyte_cdk/metadata_models/py.typed (1)

1-1: Looks good! Quick packaging question though.

The py.typed marker is correctly placed and will signal to type checkers that the package is fully typed. Just to confirm—have you verified that py.typed will be included in the package distribution? This is sometimes missed if the build configuration (pyproject.toml, setup.py, or MANIFEST.in) doesn't explicitly include package marker files. Wdyt?

bin/generate_component_manifest_files.py (2)

5-5: LGTM on the new imports!

The addition of tempfile for temporary schema storage and httpx for async HTTP requests are appropriate choices for this functionality.

Also applies to: 11-11


224-231: Clarify mutual exclusivity of post_process and metadata_models flags, wdyt?

The conditional logic here suggests post_process and metadata_models should be mutually exclusive, but the function signature allows both to be True. If both are True, the elif on line 227 means metadata_models takes precedence and post_process is silently ignored.

Consider either:

  1. Adding validation to ensure only one is True:
 ) -> None:
     """Generate Pydantic models from YAML schemas using datamodel-codegen."""
+    if post_process and metadata_models:
+        raise ValueError("post_process and metadata_models cannot both be True")
+    
     init_module_content = generate_init_module_content(yaml_files)
  1. Or simplifying to a single enum parameter:
async def generate_models_from_schemas(
    dagger_client: dagger.Client,
    yaml_dir_path: str,
    output_dir_path: str,
    yaml_files: list[str],
    processing_mode: Literal["none", "declarative", "metadata"] = "none",
) -> None:

This would make the intent clearer and prevent misuse.

Comment on lines 8 to 10
```
airbyte-ci/connectors/metadata_service/lib/metadata_service/models/src/
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specifier to fenced code block, wdyt?

The fenced code block showing the repository path should have a language specifier for proper syntax highlighting and markdown lint compliance.

Apply this diff:

-```
+```text
 airbyte-ci/connectors/metadata_service/lib/metadata_service/models/src/

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

8-8: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

In airbyte_cdk/metadata_models/README.md around lines 8 to 10, the fenced code
block lacks a language specifier; update the opening fence from totext
so the block becomes a plain-text fenced code block (i.e., replace the existing
fence with ```text and keep the content and closing fence unchanged).


</details>

<!-- This is an auto-generated comment by CodeRabbit -->

@github-actions
Copy link

github-actions bot commented Oct 21, 2025

PyTest Results (Fast)

3 817 tests  ±0   3 805 ✅ ±0   6m 31s ⏱️ +2s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 3c2a4f8. ± Comparison against base commit c0ae1c0.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Oct 21, 2025

PyTest Results (Full)

3 820 tests  ±0   3 808 ✅ ±0   10m 47s ⏱️ -26s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 3c2a4f8. ± Comparison against base commit c0ae1c0.

♻️ This comment has been updated with latest results.

Copy link
Contributor Author

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin, let's move the generated models module until this module instead: airbyte_cdk.test.models.connector_metadata

…etadata

- Moved from airbyte_cdk.metadata_models.generated to airbyte_cdk.test.models.connector_metadata
- Updated build script to output to new location
- Updated README with new import paths

Co-Authored-By: AJ Steers <aj@airbyte.io>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

♻️ Duplicate comments (13)
airbyte_cdk/test/models/connector_metadata/README.md (1)

8-10: Address the markdown lint issue.

A previous review flagged this fenced code block for missing a language specifier. While this is a minor formatting issue, it's good practice to fix for markdown compliance.

Apply this diff:

-```
+```text
 airbyte-ci/connectors/metadata_service/lib/metadata_service/models/src/

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/ReleaseStage.py (1)</summary><blockquote>

`9-13`: **Same duplication issue as SupportLevel, wdyt?**

The `ReleaseStage` enum is also defined identically in at least 5 other generated files (ConnectorMetadataDefinitionV0.py, ConnectorRegistryDestinationDefinition.py, ConnectorRegistryReleases.py, ConnectorRegistrySourceDefinition.py, ConnectorRegistryV0.py), causing the same type identity and import conflicts described in the SupportLevel comment.




This is part of the same systemic code generation issue where shared types are being duplicated rather than imported.

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/SourceFileInfo.py (1)</summary><blockquote>

`11-16`: **Same duplication issue affects SourceFileInfo, wdyt?**

The `SourceFileInfo` model is defined identically in at least 6 other files (ConnectorMetadataDefinitionV0.py, ConnectorRegistryDestinationDefinition.py, ConnectorRegistryReleases.py, ConnectorRegistrySourceDefinition.py, ConnectorRegistryV0.py, GeneratedFields.py). This is part of the same systemic code generation issue.




Since this is a Pydantic model rather than just an enum, the type conflicts are even more severe—instances won't validate against models from different modules.

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/ResourceRequirements.py (1)</summary><blockquote>

`11-18`: **ResourceRequirements also duplicated across 5+ files, wdyt?**

The `ResourceRequirements` model is defined identically in at least 5 other files (ActorDefinitionResourceRequirements.py, ConnectorMetadataDefinitionV0.py, ConnectorRegistryDestinationDefinition.py, ConnectorRegistryReleases.py, RegistryOverrides.py), part of the same code generation issue.

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/TestConnections.py (1)</summary><blockquote>

`9-14`: **TestConnections duplicated in 2 files, wdyt?**

The `TestConnections` model is defined identically in ConnectorMetadataDefinitionV0.py (lines 56-61) and ConnectorTestSuiteOptions.py (lines 31-36), part of the same code generation issue.

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/RemoteRegistries.py (1)</summary><blockquote>

`11-23`: **Both PyPi and RemoteRegistries duplicated, wdyt?**

Both `PyPi` (lines 11-16) and `RemoteRegistries` (lines 19-23) are defined identically in ConnectorMetadataDefinitionV0.py (PyPi at lines 203-208, RemoteRegistries at lines 322-326), part of the same code generation issue.

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/GeneratedFields.py (1)</summary><blockquote>

`13-32`: **GitInfo duplication already flagged.**

This `GitInfo` definition is identical to the one in `GitInfo.py`. See the earlier comment on that file regarding the broader duplication pattern.

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/GitInfo.py (1)</summary><blockquote>

`12-31`: **Duplicate model definition - part of broader pattern.**

The `GitInfo` model is duplicated across at least 6 files in this package (ConnectorMetadataDefinitionV0.py, ConnectorRegistryDestinationDefinition.py, ConnectorRegistryReleases.py, ConnectorRegistrySourceDefinition.py, ConnectorRegistryV0.py, and GeneratedFields.py). This is the most widely duplicated model in the codebase.

Consider this as part of the broader duplication issue flagged in Secret.py. The generation script should be configured to create these shared models once and use imports. Wdyt?

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/RolloutConfiguration.py (1)</summary><blockquote>

`11-29`: **Model structure looks good; duplication noted.**

The `RolloutConfiguration` model has well-chosen defaults and appropriate validation constraints (percentage ranges, minimum delay). However, it's duplicated in ConnectorMetadataDefinitionV0.py, ConnectorRegistryReleases.py, and ConnectorRegistrySourceDefinition.py.

This is part of the broader duplication pattern. Wdyt about consolidating these shared models?

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/AirbyteInternal.py (1)</summary><blockquote>

`29-39`: **Model structure is appropriate; duplication noted.**

The `AirbyteInternal` model correctly uses `Extra.allow` (unlike most other models in this package that use `Extra.forbid`), which is appropriate for internal/extensible metadata. The defaults are sensible (isEnterprise=False, requireVersionIncrementsInPullRequests=True).

However, this model along with the `Sl` and `Ql` enums is duplicated in ConnectorMetadataDefinitionV0.py, ConnectorRegistryDestinationDefinition.py, and ConnectorRegistrySourceDefinition.py—part of the broader duplication pattern.

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/ConnectorBreakingChanges.py (1)</summary><blockquote>

`13-70`: **Breaking change models are well-structured; duplication noted.**

The breaking change models use appropriate Pydantic v1 patterns:
- `const=True` for type discrimination in `StreamBreakingChangeScope`
- `__root__` wrappers for dict-based schemas
- `date` and `AnyUrl` types for proper validation

However, all five models are duplicated in ConnectorMetadataDefinitionV0.py—part of the broader duplication pattern discussed in earlier files.

</blockquote></details>
<details>
<summary>airbyte_cdk/test/models/connector_metadata/ConnectorTestSuiteOptions.py (1)</summary><blockquote>

`19-49`: **Duplicate definitions: Should import from Secret.py instead.**

`SecretStore` (lines 19-29) and `Secret` (lines 40-49) are identical to the models defined in `Secret.py` within the same PR. Since both files are in the same package (`airbyte_cdk/test/models/connector_metadata/`), `ConnectorTestSuiteOptions.py` should import these models rather than redefining them:

```python
from airbyte_cdk.test.models.connector_metadata.Secret import Secret, SecretStore

This would eliminate the duplication and ensure consistency. Wdyt?

airbyte_cdk/test/models/connector_metadata/ActorDefinitionResourceRequirements.py (1)

22-48: Resource requirements models are well-structured; duplication noted.

The models provide good flexibility with global defaults and per-job overrides. The JobType enum comprehensively covers connector operations.

However, all four models are duplicated in ConnectorMetadataDefinitionV0.py—part of the broader duplication pattern.

🧹 Nitpick comments (22)
airbyte_cdk/test/models/connector_metadata/ConnectorMetrics.py (1)

12-15: Consider stronger typing for metrics fields, wdyt?

The ConnectorMetrics class uses Optional[Any] for all three fields (all, cloud, oss), which sacrifices type safety. This might be intentional for flexible metric schemas, but if the metrics structure is known, consider using more specific types (e.g., Optional[ConnectorMetric] or a typed dictionary).

That said, the flexibility might be needed if metrics schemas vary. What's the rationale here?

bin/generate_component_manifest_files.py (1)

160-179: Consider extracting the shared post-processing logic, wdyt?

The post_process_metadata_models function has nearly identical logic to post_process_codegen (lines 136-157), both iterating through generated files and applying string replacements. The main difference is that post_process_metadata_models only replaces pydantic imports while post_process_codegen does additional replacements.

Consider extracting the common pattern into a shared helper:

+async def post_process_generated_files(
+    codegen_container: dagger.Container,
+    replacements: list[tuple[str, str]],
+    apply_deprecation_fix: bool = False
+) -> dagger.Container:
+    """Apply string replacements to generated Python files."""
+    codegen_container = codegen_container.with_exec(
+        ["mkdir", "/generated_post_processed"], use_entrypoint=True
+    )
+    for generated_file in await codegen_container.directory("/generated").entries():
+        if generated_file.endswith(".py"):
+            original_content = await codegen_container.file(
+                f"/generated/{generated_file}"
+            ).contents()
+            
+            post_processed_content = original_content
+            for old, new in replacements:
+                post_processed_content = post_processed_content.replace(old, new)
+            
+            if apply_deprecation_fix:
+                post_processed_content = replace_base_model_for_classes_with_deprecated_fields(
+                    post_processed_content
+                )
+            
+            codegen_container = codegen_container.with_new_file(
+                f"/generated_post_processed/{generated_file}", contents=post_processed_content
+            )
+    return codegen_container

 async def post_process_codegen(codegen_container: dagger.Container):
-    codegen_container = codegen_container.with_exec(
-        ["mkdir", "/generated_post_processed"], use_entrypoint=True
-    )
-    for generated_file in await codegen_container.directory("/generated").entries():
-        if generated_file.endswith(".py"):
-            original_content = await codegen_container.file(
-                f"/generated/{generated_file}"
-            ).contents()
-            post_processed_content = original_content.replace(
-                " _parameters:", " parameters:"
-            ).replace("from pydantic", "from pydantic.v1")
-
-            post_processed_content = replace_base_model_for_classes_with_deprecated_fields(
-                post_processed_content
-            )
-
-            codegen_container = codegen_container.with_new_file(
-                f"/generated_post_processed/{generated_file}", contents=post_processed_content
-            )
-    return codegen_container
+    return await post_process_generated_files(
+        codegen_container,
+        replacements=[(" _parameters:", " parameters:"), ("from pydantic", "from pydantic.v1")],
+        apply_deprecation_fix=True
+    )

 async def post_process_metadata_models(codegen_container: dagger.Container):
     """Post-process metadata models to use pydantic.v1 compatibility layer."""
-    codegen_container = codegen_container.with_exec(
-        ["mkdir", "/generated_post_processed"], use_entrypoint=True
-    )
-    for generated_file in await codegen_container.directory("/generated").entries():
-        if generated_file.endswith(".py"):
-            original_content = await codegen_container.file(
-                f"/generated/{generated_file}"
-            ).contents()
-            
-            post_processed_content = original_content.replace(
-                "from pydantic", "from pydantic.v1"
-            )
-            
-            codegen_container = codegen_container.with_new_file(
-                f"/generated_post_processed/{generated_file}", contents=post_processed_content
-            )
-    return codegen_container
+    return await post_process_generated_files(
+        codegen_container,
+        replacements=[("from pydantic", "from pydantic.v1")],
+        apply_deprecation_fix=False
+    )
airbyte_cdk/test/models/connector_metadata/ActorDefinitionResourceRequirements.py (1)

12-19: Consider adding validation for resource requirement formats.

The ResourceRequirements fields are Optional[str] with no format validation. While this flexibility allows Kubernetes-style values like "100m" or "1Gi", it also accepts invalid values like "abc" or "100xyz" that would fail at runtime.

Since these are auto-generated models, would it make sense to add format validation in the upstream YAML schema (e.g., regex patterns or custom validators)? Wdyt?

airbyte_cdk/test/models/connector_metadata/GeneratedFields.py (1)

49-58: Consider consolidating identical enums.

Usage and SyncSuccessRate have identical members (low, medium, high). If these represent the same conceptual scale, they could be consolidated into a single enum like MetricLevel to reduce duplication. However, keeping them separate provides better semantic clarity if they represent distinct dimensions.

Wdyt—worth consolidating or better to keep separate for clarity?

airbyte_cdk/test/models/connector_metadata/RegistryOverrides.py (1)

12-18: Should Extra be forbid on simple leaf models to catch typos?

AllowedHosts/SuggestedStreams/NormalizationDestinationDefinitionConfig use Extra.allow, which can mask misspelled fields on these small shapes. Would switching to Extra.forbid for these specific leaves help catch config mistakes, or is “additionalProperties” intentionally allowed in the upstream schema, wdyt?

Also applies to: 22-37

airbyte_cdk/test/models/connector_metadata/ConnectorReleases.py (2)

43-44: Prefer Literal for const fields for clearer runtime checks

Using Any with const works but is looser than necessary. Shall we make this a Literal for better type clarity, wdyt?

-from typing import Any, Dict, List, Optional
+from typing import Any, Dict, List, Optional, Literal
@@ class StreamBreakingChangeScope(BaseModel):
-    scopeType: Any = Field("stream", const=True)
+    scopeType: Literal["stream"] = "stream"

17-32: Add cross-field check: initialPercentage ≤ maxPercentage

A simple validator will prevent inconsistent rollout configs. Should we add it, wdyt?

 class RolloutConfiguration(BaseModel):
@@
     advanceDelayMinutes: Optional[conint(ge=10)] = Field(
         10,
         description="The number of minutes to wait before advancing the rollout percentage.",
     )
+
+    from pydantic.v1 import root_validator
+
+    @root_validator
+    def _validate_rollout_bounds(cls, values):
+        init = values.get("initialPercentage")
+        maxp = values.get("maxPercentage")
+        if init is not None and maxp is not None and init > maxp:
+            raise ValueError("initialPercentage must be ≤ maxPercentage")
+        return values
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (1)

298-301: Tighten URL/time/number types and add a small integrity check

Would you be open to these stricter types and a light data check, wdyt?

  • Use AnyUrl where fields are URLs:
-    documentationUrl: str
+    documentationUrl: AnyUrl
@@
-    sbomUrl: Optional[str] = Field(None, description="URL to the SBOM file")
+    sbomUrl: Optional[AnyUrl] = Field(None, description="URL to the SBOM file")
  • Use datetime for timestamps in SourceFileInfo:
-    metadata_last_modified: Optional[str] = None
-    registry_entry_generated_at: Optional[str] = None
+    metadata_last_modified: Optional[datetime] = None
+    registry_entry_generated_at: Optional[datetime] = None
  • Enforce positive values for maxSecondsBetweenMessages:
-from typing import Any, Dict, List, Optional, Union
+from typing import Any, Dict, List, Optional, Union
@@
-    maxSecondsBetweenMessages: Optional[int] = Field(
+    maxSecondsBetweenMessages: Optional[conint(ge=1)] = Field(
         None,
         description="Number of seconds allowed between 2 airbyte protocol messages. The source will timeout if this delay is reach",
     )
  • Ensure unique jobType in jobSpecific (same pattern as suggested elsewhere):
 class ActorDefinitionResourceRequirements(BaseModel):
@@
     jobSpecific: Optional[List[JobTypeResourceLimit]] = None
+    from pydantic.v1 import validator
+    @validator("jobSpecific")
+    def _unique_job_types(cls, v):
+        if not v:
+            return v
+        seen = set()
+        for item in v:
+            if item.jobType in seen:
+                raise ValueError(f"duplicate jobType '{item.jobType.value}' in jobSpecific")
+            seen.add(item.jobType)
+        return v

Also applies to: 331-334, 240-241, 165-170, 186-188, 326-329, 247-252

airbyte_cdk/test/models/connector_metadata/ConnectorRegistryV0.py (1)

303-304: Optional tightenings: URLs, datetimes, and rollout bounds

Would you consider these small improvements, wdyt?

  • Use AnyUrl for documentationUrl and sbomUrl:
-    documentationUrl: str
+    documentationUrl: AnyUrl
@@
-    sbomUrl: Optional[str] = Field(None, description="URL to the SBOM file")
+    sbomUrl: Optional[AnyUrl] = Field(None, description="URL to the SBOM file")
  • Use datetime for SourceFileInfo timestamps:
-    metadata_last_modified: Optional[str] = None
-    registry_entry_generated_at: Optional[str] = None
+    metadata_last_modified: Optional[datetime] = None
+    registry_entry_generated_at: Optional[datetime] = None
  • Add rollout bound check:
 class RolloutConfiguration(BaseModel):
@@
     advanceDelayMinutes: Optional[conint(ge=10)] = Field(
         10,
         description="The number of minutes to wait before advancing the rollout percentage.",
     )
+    from pydantic.v1 import root_validator
+    @root_validator
+    def _validate_rollout_bounds(cls, values):
+        init, maxp = values.get("initialPercentage"), values.get("maxPercentage")
+        if init is not None and maxp is not None and init > maxp:
+            raise ValueError("initialPercentage must be ≤ maxPercentage")
+        return values

Also applies to: 395-399, 240-241, 183-189, 79-94

airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1)

236-241: Minor hardening: URLs, datetimes, Literal scope type, rollout bound, uniqueness

Would you consider these small improvements, wdyt?

  • Use AnyUrl for sbomUrl:
-    sbomUrl: Optional[str] = Field(None, description="URL to the SBOM file")
+    sbomUrl: Optional[AnyUrl] = Field(None, description="URL to the SBOM file")
  • Use datetime for SourceFileInfo timestamps:
-    metadata_last_modified: Optional[str] = None
-    registry_entry_generated_at: Optional[str] = None
+    metadata_last_modified: Optional[datetime] = None
+    registry_entry_generated_at: Optional[datetime] = None
  • Prefer Literal for scopeType:
-from typing import Any, Dict, List, Optional, Union
+from typing import Any, Dict, List, Optional, Union, Literal
@@ class StreamBreakingChangeScope(BaseModel):
-    scopeType: Any = Field("stream", const=True)
+    scopeType: Literal["stream"] = "stream"
  • Add rollout bounds validator:
 class RolloutConfiguration(BaseModel):
@@
     advanceDelayMinutes: Optional[conint(ge=10)] = Field(
         10,
         description="The number of minutes to wait before advancing the rollout percentage.",
     )
+    from pydantic.v1 import root_validator
+    @root_validator
+    def _validate_rollout_bounds(cls, values):
+        init, maxp = values.get("initialPercentage"), values.get("maxPercentage")
+        if init is not None and maxp is not None and init > maxp:
+            raise ValueError("initialPercentage must be ≤ maxPercentage")
+        return values
  • Ensure unique jobType in jobSpecific:
 class ActorDefinitionResourceRequirements(BaseModel):
@@
     jobSpecific: Optional[List[JobTypeResourceLimit]] = None
+    from pydantic.v1 import validator
+    @validator("jobSpecific")
+    def _unique_job_types(cls, v):
+        if not v:
+            return v
+        seen = set()
+        for item in v:
+            if item.jobType in seen:
+                raise ValueError(f"duplicate jobType '{item.jobType.value}' in jobSpecific")
+            seen.add(item.jobType)
+        return v

Also applies to: 165-170, 146-161, 221-225, 35-37, 18-32

airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (6)

1-3: Reproducibility: embed generator + schema provenance?

Since builds fetch schemas at build time, could we stamp the generator version, source schema URL, and commit SHA in this header to make artifacts reproducible and debuggable, wdyt?


101-111: Use Literal for const field to tighten validation and typing

scopeType is effectively a constant. Shall we use Literal["stream"] instead of Any+const for stronger type-checking and clearer errors, wdyt?

Apply this diff:

-from typing import Any, Dict, List, Optional, Union
+from typing import Any, Dict, List, Optional, Union, Literal
@@
-class StreamBreakingChangeScope(BaseModel):
+class StreamBreakingChangeScope(BaseModel):
@@
-    scopeType: Any = Field("stream", const=True)
+    scopeType: Literal["stream"] = Field("stream", const=True)

236-241: URL typing: prefer AnyUrl for sbomUrl

sbomUrl looks like a URL. Would you switch to AnyUrl for stricter validation, wdyt?

-    sbomUrl: Optional[str] = Field(None, description="URL to the SBOM file")
+    sbomUrl: Optional[AnyUrl] = Field(None, description="URL to the SBOM file")

323-327: Semver: add a regex constraint on protocolVersion

If protocolVersion is semver, shall we add a constraint to avoid accidental non-semver strings, wdyt?

-    protocolVersion: Optional[str] = Field(
+    protocolVersion: Optional[constr(regex=r'^\d+\.\d+\.\d+(?:-[0-9A-Za-z-.]+)?$')] = Field(
         None, description="the Airbyte Protocol version supported by the connector"
     )

294-301: <!-- [scratchpad_update]
[observations_updated]

  • Web search confirms AnyUrl is a valid Pydantic type from the networks module
  • AnyUrl provides URL validation that catches malformed values at parse-time
  • Pydantic v2 behavior: AnyUrl appends trailing slash; may affect downstream usage
  • The suggestion to use AnyUrl is technically sound and recommended by Pydantic docs
  • Need to verify: (1) Pydantic version used in this codebase, (2) whether AnyUrl is imported

[pending]

  • Check Pydantic import and version in ConnectorRegistryDestinationDefinition.py
  • Verify if changing to AnyUrl would break any dependent code (due to type differences)
  • Determine if this needs coordination across all 8 instances found
    [scratchpad_end] --><!-- [scratchpad_update_2]
    [key_finding]
  • Airbyte CDK 2.0 (released mid-2024) uses Pydantic V2
  • The files being reviewed are test/models/connector_metadata files
  • These are likely part of the updated Pydantic V2 stack
  • AnyUrl is available in Pydantic V2

[critical_observation]

  • Need to check if these files are already using Pydantic V2 imports
  • Suggested change (str -> AnyUrl) is valid for Pydantic V2
  • The pattern exists across 8 locations in 4 files
  • This represents a systemic opportunity for coordinated refactoring

[pending]

  • Verify imports in target file to confirm Pydantic V2 usage
  • Check if AnyUrl is already imported
  • Determine scope of refactor (all 8 instances or just the target file?)
    [scratchpad_end] -->documentationUrl/iconUrl: widen refactor scope—apply across all registry models

The suggestion to use AnyUrl for validation is sound. However, the verification found this pattern repeated in 8 locations across 4 files:

  • ConnectorRegistryV0.py (lines 303, 395)
  • ConnectorRegistrySourceDefinition.py (lines 298, 389)
  • ConnectorRegistryReleases.py (lines 335, 389)
  • ConnectorRegistryDestinationDefinition.py (lines 298, 390)

Rather than fixing just the target file, consider coordinating this into a single refactor pass across all registry models to maintain consistency. With CDK 2.0, Pydantic V2 is available, so AnyUrl is a valid choice. Would updating all four files at once work for your workflow, or should we proceed incrementally?


436-439: Consider adding locals() and including SourceDefinition for robustness and consistency

Your suggestion is spot on. The file defines ConnectorRegistrySourceDefinition at line 382, yet it's not receiving an update_forward_refs() call while the other three registry classes are. Passing locals() also aligns with Pydantic best practices for ensuring forward references resolve against all types in the current module scope.

The refactored version would look like:

_ns = locals()
ConnectorRegistryDestinationDefinition.update_forward_refs(**_ns)
ConnectorRegistryReleases.update_forward_refs(**_ns)
ConnectorReleaseCandidates.update_forward_refs(**_ns)
VersionReleaseCandidate.update_forward_refs(**_ns)
ConnectorRegistrySourceDefinition.update_forward_refs(**_ns)

This makes the handling symmetric and more resilient. Worth adopting?

airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (6)

1-3: Reproducibility: capture generator + schema commit in header

To help future debugging (and rate-limit workarounds), could we include the datamodel-codegen version, source schema URL(s), and pinned commit SHA here, wdyt?


166-172: Use Literal for const field

Same as the other module: would you switch scopeType to Literal["stream"] for stricter typing, wdyt?

-    scopeType: Any = Field("stream", const=True)
+    scopeType: Literal["stream"] = Field("stream", const=True)

330-335: URL typing: sbomUrl could be AnyUrl

Align with other URL fields to validate early. Shall we switch, wdyt?

-    sbomUrl: Optional[str] = Field(None, description="URL to the SBOM file")
+    sbomUrl: Optional[AnyUrl] = Field(None, description="URL to the SBOM file")

397-404: connectorSubtype override: align type with enum while keeping flexibility

For RegistryOverrides.connectorSubtype, would you type it as Optional[Union[str, ConnectorSubtype]] so we validate known values but still accept future strings, wdyt?

-    connectorSubtype: Optional[str] = None
+    connectorSubtype: Optional[Union[str, ConnectorSubtype]] = None

441-488: Standardize documentationUrl type across metadata and registry definitions

The inconsistency is confirmed: ConnectorMetadataDefinitionV0.py uses AnyUrl for documentationUrl (line 452), while all registry definition files (ConnectorRegistry*) use str (6 instances across multiple files). This divergence could introduce parsing surprises when the same field is handled in different contexts. Would standardizing on AnyUrl across both codegen flows make sense for your use cases?


471-474: Consider refactoring tags field to use default_factory=list with non-optional type

The current implementation uses a mutable default argument with an Optional type, which is a Python anti-pattern. The suggested change to List[str] = Field(default_factory=list, ...) aligns with Pydantic best practices and eliminates the mutable default concern. A broader search of the codebase didn't surface other similar patterns, so this appears to be an isolated instance. Wdyt about making this change?

-    tags: Optional[List[str]] = Field(
-        [],
-        description="An array of tags that describe the connector. E.g: language:python, keyword:rds, etc.",
-    )
+    tags: List[str] = Field(
+        default_factory=list,
+        description="An array of tags that describe the connector. E.g: language:python, keyword:rds, etc.",
+    )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff01ea0 and 63930c6.

📒 Files selected for processing (34)
  • airbyte_cdk/test/models/connector_metadata/ActorDefinitionResourceRequirements.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/AirbyteInternal.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/AllowedHosts.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorBreakingChanges.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorBuildOptions.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorIPCOptions.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorMetrics.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorPackageInfo.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorRegistryV0.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorReleases.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ConnectorTestSuiteOptions.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/GeneratedFields.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/GitInfo.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/JobType.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/NormalizationDestinationDefinitionConfig.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/README.md (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/RegistryOverrides.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ReleaseStage.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/RemoteRegistries.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/ResourceRequirements.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/RolloutConfiguration.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/Secret.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/SecretStore.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/SourceFileInfo.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/SuggestedStreams.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/SupportLevel.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/TestConnections.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/__init__.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/py.typed (1 hunks)
  • bin/generate_component_manifest_files.py (3 hunks)
✅ Files skipped from review due to trivial changes (3)
  • airbyte_cdk/test/models/connector_metadata/py.typed
  • airbyte_cdk/test/models/connector_metadata/init.py
  • airbyte_cdk/test/models/connector_metadata/JobType.py
🧰 Additional context used
🧬 Code graph analysis (21)
airbyte_cdk/test/models/connector_metadata/SupportLevel.py (5)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (1)
  • SupportLevel (72-75)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (1)
  • SupportLevel (21-24)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1)
  • SupportLevel (66-69)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (1)
  • SupportLevel (28-31)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryV0.py (1)
  • SupportLevel (21-24)
airbyte_cdk/test/models/connector_metadata/AllowedHosts.py (6)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (7)
  • AllowedHosts (78-85)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
  • Config (107-108)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (1)
  • AllowedHosts (65-72)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1)
  • AllowedHosts (92-99)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (1)
  • AllowedHosts (54-61)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryV0.py (1)
  • AllowedHosts (65-72)
airbyte_cdk/test/models/connector_metadata/RegistryOverrides.py (1)
  • AllowedHosts (12-19)
airbyte_cdk/test/models/connector_metadata/ResourceRequirements.py (5)
airbyte_cdk/test/models/connector_metadata/ActorDefinitionResourceRequirements.py (4)
  • ResourceRequirements (12-19)
  • Config (13-14)
  • Config (33-34)
  • Config (41-42)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (6)
  • ResourceRequirements (116-123)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (1)
  • ResourceRequirements (27-34)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1)
  • ResourceRequirements (72-79)
airbyte_cdk/test/models/connector_metadata/RegistryOverrides.py (1)
  • ResourceRequirements (50-57)
airbyte_cdk/test/models/connector_metadata/Secret.py (3)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (7)
  • SecretStore (44-54)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
  • Secret (296-305)
airbyte_cdk/test/models/connector_metadata/ConnectorTestSuiteOptions.py (2)
  • SecretStore (19-29)
  • Secret (40-49)
airbyte_cdk/test/models/connector_metadata/SecretStore.py (1)
  • SecretStore (11-21)
airbyte_cdk/test/models/connector_metadata/ConnectorIPCOptions.py (1)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (11)
  • SupportedSerializationEnum (269-272)
  • SupportedTransportEnum (275-277)
  • DataChannel (280-286)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
  • Config (107-108)
  • Config (117-118)
  • ConnectorIPCOptions (289-293)
airbyte_cdk/test/models/connector_metadata/NormalizationDestinationDefinitionConfig.py (2)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (6)
  • NormalizationDestinationDefinitionConfig (88-103)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (1)
  • NormalizationDestinationDefinitionConfig (47-62)
airbyte_cdk/test/models/connector_metadata/ConnectorBreakingChanges.py (1)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (13)
  • DeadlineAction (157-159)
  • StreamBreakingChangeScope (162-171)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
  • Config (107-108)
  • Config (117-118)
  • Config (137-138)
  • BreakingChangeScope (316-320)
  • VersionBreakingChange (362-384)
  • ConnectorBreakingChanges (406-414)
airbyte_cdk/test/models/connector_metadata/RegistryOverrides.py (4)
airbyte_cdk/test/models/connector_metadata/ActorDefinitionResourceRequirements.py (7)
  • Config (13-14)
  • Config (33-34)
  • Config (41-42)
  • ResourceRequirements (12-19)
  • JobType (22-29)
  • JobTypeResourceLimit (32-37)
  • ActorDefinitionResourceRequirements (40-48)
airbyte_cdk/test/models/connector_metadata/NormalizationDestinationDefinitionConfig.py (1)
  • NormalizationDestinationDefinitionConfig (9-24)
airbyte_cdk/test/models/connector_metadata/ResourceRequirements.py (1)
  • ResourceRequirements (11-18)
airbyte_cdk/test/models/connector_metadata/JobType.py (1)
  • JobType (9-16)
airbyte_cdk/test/models/connector_metadata/RemoteRegistries.py (1)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (7)
  • PyPi (204-209)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
  • RemoteRegistries (323-327)
airbyte_cdk/test/models/connector_metadata/ConnectorPackageInfo.py (4)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (1)
  • ConnectorPackageInfo (217-218)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1)
  • ConnectorPackageInfo (199-200)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (1)
  • ConnectorPackageInfo (217-218)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryV0.py (1)
  • ConnectorPackageInfo (217-218)
airbyte_cdk/test/models/connector_metadata/TestConnections.py (2)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (6)
  • TestConnections (57-62)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
airbyte_cdk/test/models/connector_metadata/ConnectorTestSuiteOptions.py (1)
  • TestConnections (32-37)
airbyte_cdk/test/models/connector_metadata/ConnectorBuildOptions.py (1)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (7)
  • ConnectorBuildOptions (30-34)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
  • Config (107-108)
airbyte_cdk/test/models/connector_metadata/SourceFileInfo.py (6)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (1)
  • SourceFileInfo (234-239)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (1)
  • SourceFileInfo (182-187)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1)
  • SourceFileInfo (164-169)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (1)
  • SourceFileInfo (182-187)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryV0.py (1)
  • SourceFileInfo (182-187)
airbyte_cdk/test/models/connector_metadata/GeneratedFields.py (1)
  • SourceFileInfo (35-40)
airbyte_cdk/test/models/connector_metadata/GitInfo.py (6)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (6)
  • GitInfo (212-231)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (1)
  • GitInfo (160-179)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1)
  • GitInfo (142-161)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (1)
  • GitInfo (160-179)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryV0.py (1)
  • GitInfo (160-179)
airbyte_cdk/test/models/connector_metadata/GeneratedFields.py (1)
  • GitInfo (13-32)
airbyte_cdk/test/models/connector_metadata/ActorDefinitionResourceRequirements.py (1)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (12)
  • ResourceRequirements (116-123)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
  • Config (107-108)
  • Config (117-118)
  • Config (137-138)
  • JobType (126-133)
  • JobTypeResourceLimit (308-313)
  • ActorDefinitionResourceRequirements (351-359)
airbyte_cdk/test/models/connector_metadata/ReleaseStage.py (5)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (1)
  • ReleaseStage (65-69)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (1)
  • ReleaseStage (14-18)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1)
  • ReleaseStage (59-63)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (1)
  • ReleaseStage (21-25)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryV0.py (1)
  • ReleaseStage (14-18)
airbyte_cdk/test/models/connector_metadata/RolloutConfiguration.py (3)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (6)
  • RolloutConfiguration (136-154)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (1)
  • RolloutConfiguration (14-32)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (1)
  • RolloutConfiguration (74-92)
airbyte_cdk/test/models/connector_metadata/ConnectorMetrics.py (2)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (9)
  • ConnectorMetrics (242-245)
  • Usage (248-251)
  • SyncSuccessRate (254-257)
  • ConnectorMetric (260-266)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
airbyte_cdk/test/models/connector_metadata/GeneratedFields.py (4)
  • ConnectorMetrics (43-46)
  • Usage (49-52)
  • SyncSuccessRate (55-58)
  • ConnectorMetric (61-67)
airbyte_cdk/test/models/connector_metadata/GeneratedFields.py (1)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (12)
  • GitInfo (212-231)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
  • SourceFileInfo (234-239)
  • ConnectorMetrics (242-245)
  • Usage (248-251)
  • SyncSuccessRate (254-257)
  • ConnectorMetric (260-266)
  • GeneratedFields (330-334)
airbyte_cdk/test/models/connector_metadata/AirbyteInternal.py (3)
airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (9)
  • Sl (174-178)
  • Ql (181-188)
  • AirbyteInternal (191-201)
  • Config (31-32)
  • Config (45-46)
  • Config (58-59)
  • Config (79-80)
  • Config (89-90)
  • Config (107-108)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryDestinationDefinition.py (3)
  • Sl (130-134)
  • Ql (137-144)
  • AirbyteInternal (147-157)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (3)
  • Sl (130-134)
  • Ql (137-144)
  • AirbyteInternal (147-157)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistryReleases.py (11)
airbyte_cdk/test/models/connector_metadata/ConnectorRegistrySourceDefinition.py (28)
  • RolloutConfiguration (74-92)
  • DeadlineAction (95-97)
  • StreamBreakingChangeScope (100-109)
  • SourceType (14-18)
  • ReleaseStage (21-25)
  • SupportLevel (28-31)
  • ResourceRequirements (34-41)
  • JobType (44-51)
  • AllowedHosts (54-61)
  • SuggestedStreams (64-71)
  • Sl (130-134)
  • Ql (137-144)
  • AirbyteInternal (147-157)
  • GitInfo (160-179)
  • SourceFileInfo (182-187)
  • ConnectorMetrics (190-193)
  • ConnectorMetric (208-214)
  • ConnectorPackageInfo (217-218)
  • NormalizationDestinationDefinitionConfig (112-127)
  • BreakingChangeScope (229-233)
  • JobTypeResourceLimit (221-226)
  • GeneratedFields (236-240)
  • VersionBreakingChange (254-276)
  • ActorDefinitionResourceRequirements (243-251)
  • ConnectorBreakingChanges (279-287)
  • ConnectorRegistryReleases (344-354)
  • ConnectorRegistrySourceDefinition (290-341)
  • ConnectorRegistryDestinationDefinition (381-433)
airbyte_cdk/test/models/connector_metadata/RolloutConfiguration.py (1)
  • RolloutConfiguration (11-29)
airbyte_cdk/test/models/connector_metadata/ActorDefinitionResourceRequirements.py (7)
  • Config (13-14)
  • Config (33-34)
  • Config (41-42)
  • ResourceRequirements (12-19)
  • JobType (22-29)
  • JobTypeResourceLimit (32-37)
  • ActorDefinitionResourceRequirements (40-48)
airbyte_cdk/test/models/connector_metadata/AirbyteInternal.py (4)
  • Config (30-31)
  • Sl (12-16)
  • Ql (19-26)
  • AirbyteInternal (29-39)
airbyte_cdk/test/models/connector_metadata/AllowedHosts.py (2)
  • Config (12-13)
  • AllowedHosts (11-18)
airbyte_cdk/test/models/connector_metadata/ConnectorBreakingChanges.py (8)
  • Config (19-20)
  • Config (38-39)
  • Config (63-64)
  • DeadlineAction (13-15)
  • StreamBreakingChangeScope (18-27)
  • BreakingChangeScope (30-34)
  • VersionBreakingChange (37-59)
  • ConnectorBreakingChanges (62-70)
airbyte_cdk/test/models/connector_metadata/SuggestedStreams.py (1)
  • SuggestedStreams (11-18)
airbyte_cdk/test/models/connector_metadata/GeneratedFields.py (5)
  • GitInfo (13-32)
  • SourceFileInfo (35-40)
  • ConnectorMetrics (43-46)
  • ConnectorMetric (61-67)
  • GeneratedFields (70-74)
airbyte_cdk/test/models/connector_metadata/GitInfo.py (1)
  • GitInfo (12-31)
airbyte_cdk/test/models/connector_metadata/ConnectorPackageInfo.py (1)
  • ConnectorPackageInfo (11-12)
airbyte_cdk/test/models/connector_metadata/NormalizationDestinationDefinitionConfig.py (1)
  • NormalizationDestinationDefinitionConfig (9-24)
🪛 markdownlint-cli2 (0.18.1)
airbyte_cdk/test/models/connector_metadata/README.md

8-8: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (15)
airbyte_cdk/test/models/connector_metadata/README.md (2)

1-79: The documentation structure and examples are excellent!

The README provides clear usage examples, describes the model generation workflow, and properly documents the schema source. The validation examples and field access patterns will be helpful for users.


21-21: The import path in the README is correct—no changes needed.

The script confirms the package exists at airbyte_cdk/test/models/connector_metadata/, which matches exactly what the README imports on line 21. The filesystem structure validates the documentation.

The original concern about a path mismatch appears to have been based on PR objectives that differ from the actual implementation. The README accurately reflects the real package structure.

Likely an incorrect or invalid review comment.

airbyte_cdk/test/models/connector_metadata/SecretStore.py (1)

11-21: The model structure looks good!

The use of Extra.forbid is appropriate for security-sensitive secret store configuration, preventing accidental inclusion of unvalidated fields. The Literal["GSM"] type for the secret store ensures type safety.

Quick check: Is the restriction to only GSM (Google Secret Manager) intentional for the current scope? Just want to confirm this isn't limiting support for other secret stores (AWS Secrets Manager, Azure Key Vault, etc.) that might be needed.

airbyte_cdk/test/models/connector_metadata/ConnectorMetrics.py (1)

30-36: Nice design: Union types provide flexibility!

The use of Union[str, Usage] and Union[str, SyncSuccessRate] is a good pattern—it allows both string literals and typed enums, providing backwards compatibility while enabling type-safe code when using the enums directly.

airbyte_cdk/test/models/connector_metadata/ConnectorIPCOptions.py (2)

12-20: Good IPC serialization and transport options!

The enum choices make sense:

  • Serialization: JSONL (human-readable), PROTOBUF (efficient), FLATBUFFERS (zero-copy)
  • Transport: STDIO (standard streams), SOCKET (network/IPC sockets)

This provides a solid foundation for connector IPC configuration.


23-36: Strict validation is appropriate for IPC config.

Using Extra.forbid in both DataChannel and ConnectorIPCOptions is the right choice for IPC configuration—it prevents misconfiguration from typos or unsupported fields. Making all DataChannel fields required ensures complete IPC specifications.

bin/generate_component_manifest_files.py (1)

181-232: Nice refactoring to extract the common generation flow!

The generate_models_from_schemas function nicely consolidates the common logic for generating models from YAML schemas, making it reusable for both declarative components and metadata models. The post_process and metadata_models flags provide clear control over the different post-processing paths.

airbyte_cdk/test/models/connector_metadata/SuggestedStreams.py (1)

11-18: Model structure looks good!

The SuggestedStreams model is well-defined with a clear description explaining the semantics of missing vs empty arrays. The use of Extra.allow provides forward compatibility for additional fields.

airbyte_cdk/test/models/connector_metadata/ConnectorTestSuiteOptions.py (1)

52-63: ConnectorTestSuiteOptions model structure looks good.

The model appropriately aggregates test suite configuration with required suite field and optional lists of secrets and test connections. The use of Extra.forbid is appropriate for a configuration model.

airbyte_cdk/test/models/connector_metadata/GeneratedFields.py (2)

43-46: Flexible typing with Any fields.

The ConnectorMetrics model uses Any for the all, cloud, and oss fields, which sacrifices type safety for schema flexibility. This is likely intentional to accommodate varying metric structures across environments, but it means invalid data won't be caught at validation time.

If metric schemas are known and stable, consider using more specific types. Otherwise, this trade-off seems reasonable for extensibility.


70-74: GeneratedFields aggregation looks good.

The model appropriately aggregates various generated metadata fields (git info, source file info, metrics, SBOM URL). The structure makes sense for generated/computed connector metadata.

airbyte_cdk/test/models/connector_metadata/Secret.py (1)

11-33: Confirmed: Systematic duplication of models across auto-generated files. Consider centralizing generation configuration.

The duplication you identified is real and extends far beyond SecretStore and SecretGitInfo appears in 8 files, RolloutConfiguration in 7, and the same pattern repeats across the generated models. This suggests the code generator is inlining these shared schemas rather than generating them once and importing them.

Your suggestion about configuring datamodel-codegen to generate shared models is the right direction. Since these are auto-generated files, the fix should be in the generation configuration or source schema (not by modifying the generated output directly).

Could you clarify: Is the generation configuration set up to inline these common models intentionally, or would this be worth investigating as part of the code generation pipeline improvement?

airbyte_cdk/test/models/connector_metadata/ConnectorBreakingChanges.py (1)

66-66: Regex pattern is intentional, not a limitation.

The regex ^\d+\.\d+\.\d+$ reflects a deliberate versioning strategy rather than an oversight. The project's publish workflow checks version format with the same pattern: versions matching the strict MAJOR.MINOR.PATCH format are marked as stable releases (IS_PRERELEASE=false), while versions with additional identifiers are handled separately as pre-releases. Breaking changes in this schema are scoped to production releases only, not pre-releases. Since the file is auto-generated from a YAML schema (datamodel-codegen), the pattern is maintained by the build process rather than manual editing anyway.

No further action needed here—this is working as designed.

Likely an incorrect or invalid review comment.

airbyte_cdk/test/models/connector_metadata/ConnectorReleases.py (1)

87-91: Confirm semver rule: stable versions only for breakingChanges?

Regex ^\d+.\d+.\d+$ excludes pre-releases/build metadata. Is this intentional (i.e., breaking changes only tracked on stable releases), or should RCs be allowed here too, wdyt?

airbyte_cdk/test/models/connector_metadata/ConnectorMetadataDefinitionV0.py (1)

65-76: <!-- [scratchpad_update]
[task verify_duplication 1/1]
[observations added]

  • Web search confirms datamodel-codegen has --reuse-model flag (reuses models when module has same content)
  • Duplication is widespread across 8 distinct files + multiple redefinitions in ConnectorRegistry*.py
  • Dedicated enum modules exist (ReleaseStage.py, SupportLevel.py, JobType.py) but redefinitions occur in 8 other places
  • Review comment mentioned using datamodel-codegen's model reuse feature—this is a real capability

[analysis added]

  1. The original review comment's suggestion about datamodel-codegen's model reuse is technically valid
  2. The extent of duplication is more severe than what's in ConnectorMetadataDefinitionV0.py alone—it's a systemic issue
  3. The concern about "mypy friction" and drift is valid given 8 redundant definitions
  4. Standalone enum modules appear to exist but are not being imported by dependent files

[decision]
The review comment is essentially correct. The duplication is verified. This should be marked as since the suggestion requires team decision on implementation strategy (datamodel-codegen reuse vs manual consolidation).
[scratchpad_end] -->

Verified: Extensive duplication of ReleaseStage/SupportLevel/JobType enums across the connector_metadata module

Your concern is spot-on. The script confirms these enums are redefined across 8+ files instead of being reused:

  • ReleaseStage: 6 definitions (standalone module + ConnectorMetadataDefinitionV0, ConnectorRegistryV0, ConnectorRegistryDestinationDefinition, ConnectorRegistryReleases, ConnectorRegistrySourceDefinition)
  • SupportLevel: 6 definitions (same pattern)
  • JobType: 8 definitions (including ActorDefinitionResourceRequirements and RegistryOverrides)

The original suggestion to enable datamodel-codegen's --reuse-model flag to reuse models when a module has the model with the same content, or consolidate into a single shared module, remains a solid approach to eliminate drift and type-checking friction.

Consider whether the code generation pipeline should enforce the --reuse-model option or whether refactoring toward explicit imports from the dedicated enum modules (ReleaseStage.py, SupportLevel.py, JobType.py) fits your architecture better—which path appeals to you?

Copy link
Contributor Author

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin, Great, that's much better, I think. Now, let's once again put the module until a 'generated' submodule breadcrumb. And import into airbyte_cdk/tests/models/init.py:

  • the top level manifest file model (I don't know the name)
  • the test suites definition model

This gives us most of what we would need to import for our use cases. Imports through the "generated" breadcrumb will still be possible but we can reduce surface area to maintain by only importing the models we need to call or instantiate directly.

…mports

- Moved all generated models to connector_metadata/generated/
- Added __init__.py in connector_metadata to export key models
- Updated airbyte_cdk/test/models/__init__.py to import:
  - ConnectorMetadataDefinitionV0 (top-level manifest model)
  - ConnectorTestSuiteOptions (test suites definition model)
- Updated README with new import paths and examples
- Updated build script to output to new location

This reduces the API surface area while keeping all models accessible
through the generated submodule.

Co-Authored-By: AJ Steers <aj@airbyte.io>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
bin/generate_component_manifest_files.py (3)

17-18: Master branch dependency affects build reproducibility.

These URLs point to the master branch, meaning every build fetches the latest schemas. If upstream introduces breaking changes, CDK builds will fail unpredictably.

This concern was previously raised. The recommendation was to pin to a specific commit SHA or tag for reproducible builds.


41-77: Multiple reliability concerns remain unaddressed.

Previous reviews identified several critical issues:

  1. Misleading error message (lines 55-59): Claims "Using cached schemas if available" but immediately re-raises, causing build failure with no actual fallback
  2. No GitHub authentication: Unauthenticated requests hit 60/hour rate limit vs. 5000/hour with auth
  3. No retry logic: Transient network errors cause immediate build failure
  4. No fallback mechanism: If GitHub is unavailable, builds always fail

These concerns were raised in earlier reviews and would significantly impact build reliability.


246-257: Missing error handling for schema download.

If download_metadata_schemas fails at line 249 (due to rate limits, network issues, etc.), users will see a cryptic stack trace without actionable guidance.

A previous review suggested wrapping this call in try/except to catch specific exceptions and provide helpful error messages (e.g., "Set GITHUB_TOKEN environment variable to avoid rate limits" for 403 errors, "Check network connectivity" for network errors).

airbyte_cdk/test/models/connector_metadata/README.md (1)

8-10: Add language specifier to fenced code block.

The code block showing the repository path should specify a language (e.g., text) for proper syntax highlighting and markdown lint compliance.

This was flagged in a previous review.

🧹 Nitpick comments (1)
bin/generate_component_manifest_files.py (1)

224-231: Clarify mutually exclusive flags, wdyt?

The if/elif/else structure means post_process and metadata_models are mutually exclusive—if both are True, only post-processing runs and metadata post-processing is skipped.

Is this intentional? If so, consider adding a docstring note or an assertion at the function start to make the constraint explicit, or using an enum parameter instead of two booleans.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 63930c6 and 62902f6.

⛔ Files ignored due to path filters (31)
  • airbyte_cdk/test/models/connector_metadata/generated/ActorDefinitionResourceRequirements.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/AirbyteInternal.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/AllowedHosts.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorBreakingChanges.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorBuildOptions.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorIPCOptions.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorMetadataDefinitionV0.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorMetrics.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorPackageInfo.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorRegistryDestinationDefinition.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorRegistryReleases.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorRegistrySourceDefinition.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorRegistryV0.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorReleases.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ConnectorTestSuiteOptions.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/GeneratedFields.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/GitInfo.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/JobType.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/NormalizationDestinationDefinitionConfig.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/RegistryOverrides.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ReleaseStage.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/RemoteRegistries.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/ResourceRequirements.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/RolloutConfiguration.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/Secret.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/SecretStore.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/SourceFileInfo.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/SuggestedStreams.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/SupportLevel.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/TestConnections.py is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/__init__.py is excluded by !**/generated/**
📒 Files selected for processing (4)
  • airbyte_cdk/test/models/__init__.py (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/README.md (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/__init__.py (1 hunks)
  • bin/generate_component_manifest_files.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
airbyte_cdk/test/models/__init__.py (2)
airbyte_cdk/test/models/connector_metadata/generated/ConnectorMetadataDefinitionV0.py (2)
  • ConnectorMetadataDefinitionV0 (490-495)
  • ConnectorTestSuiteOptions (337-348)
airbyte_cdk/test/models/connector_metadata/generated/ConnectorTestSuiteOptions.py (1)
  • ConnectorTestSuiteOptions (52-63)
airbyte_cdk/test/models/connector_metadata/__init__.py (2)
airbyte_cdk/test/models/connector_metadata/generated/ConnectorMetadataDefinitionV0.py (2)
  • ConnectorMetadataDefinitionV0 (490-495)
  • ConnectorTestSuiteOptions (337-348)
airbyte_cdk/test/models/connector_metadata/generated/ConnectorTestSuiteOptions.py (1)
  • ConnectorTestSuiteOptions (52-63)
🪛 markdownlint-cli2 (0.18.1)
airbyte_cdk/test/models/connector_metadata/README.md

8-8: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (7)
bin/generate_component_manifest_files.py (3)

30-32: Good refactoring for reusability!

Parameterizing the directory path makes this function reusable for both local and downloaded schemas.


34-38: LGTM!

Accepting yaml_files as a parameter improves testability and makes the function more flexible for different use cases.


160-179: LGTM!

The post-processing logic correctly applies the pydantic.v1 compatibility layer to generated metadata models, consistent with the approach used for declarative component models.

airbyte_cdk/test/models/__init__.py (2)

4-7: LGTM!

The imports correctly expose the new metadata models from the connector_metadata package.


11-16: LGTM!

The __all__ list is correctly updated to include the new metadata models, maintaining alphabetical ordering.

airbyte_cdk/test/models/connector_metadata/README.md (1)

1-92: Excellent documentation!

The README is comprehensive and well-structured, with clear usage examples covering validation, field access, and model imports. The examples accurately reflect the package structure.

airbyte_cdk/test/models/connector_metadata/__init__.py (1)

1-7: LGTM!

The package initialization correctly re-exports the primary metadata models with proper __all__ definition, making them accessible via the connector_metadata namespace.

devin-ai-integration bot and others added 2 commits October 21, 2025 01:37
…tput

- Modified build script to generate all models into a single models.py file
- Added consolidated JSON schema generation (metadata_schema.json)
- Updated imports to reference generated.models module
- Removed 30+ individual model files in favor of single-file approach
- Updated README to document new structure and outputs
- Added GitHub token support to avoid rate limiting

Co-Authored-By: AJ Steers <aj@airbyte.io>
Co-Authored-By: AJ Steers <aj@airbyte.io>
- Moved detailed documentation from README to CONTRIBUTING.md
- Simplified README to just show usage and link to CONTRIBUTING.md
- Added docstring to __init__.py pointing to CONTRIBUTING.md
- Removed py.typed from connector_metadata (already exists at root)

Co-Authored-By: AJ Steers <aj@airbyte.io>
@devin-ai-integration
Copy link
Contributor

Done! I've:

  1. ✅ Moved the detailed documentation to docs/CONTRIBUTING.md in a new section "Regenerating Connector Metadata Models"
  2. ✅ Simplified the README to just show basic usage and link to CONTRIBUTING.md
  3. ✅ Added a docstring to __init__.py that points to the CONTRIBUTING.md section
  4. ✅ Removed py.typed from connector_metadata/ since there's already one at the root (airbyte_cdk/py.typed)

The documentation is now consolidated in CONTRIBUTING.md alongside the existing "Auto-Generating the Declarative Schema File" section.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 933d478 and c89faab.

📒 Files selected for processing (3)
  • airbyte_cdk/test/models/connector_metadata/README.md (1 hunks)
  • airbyte_cdk/test/models/connector_metadata/__init__.py (1 hunks)
  • docs/CONTRIBUTING.md (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • docs/CONTRIBUTING.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/test/models/connector_metadata/README.md
🧰 Additional context used
🪛 GitHub Actions: Generate Docs
airbyte_cdk/test/models/connector_metadata/__init__.py

[error] 8-8: ModuleNotFoundError: No module named 'airbyte_cdk.test.models.connector_metadata.generated.models' during docs generation (caused by importing ConnectorMetadataDefinitionV0, ConnectorTestSuiteOptions in init.py).

🪛 GitHub Actions: Pytest (Fast)
airbyte_cdk/test/models/connector_metadata/__init__.py

[error] 1-1: ModuleNotFoundError: No module named 'airbyte_cdk.test.models.connector_metadata.generated.models' during test collection. Ensure the generated models module exists or is generated before tests.


[error] 1-1: Import cascading failure from test module import caused pytest collection to fail.

🪛 GitHub Actions: PyTest Matrix
airbyte_cdk/test/models/connector_metadata/__init__.py

[error] 1-1: ModuleNotFoundError: No module named 'airbyte_cdk.test.models.connector_metadata.generated.models'. This prevents tests from importing ConnectorMetadata definitions during collection.

🪛 GitHub Actions: Test Connectors
airbyte_cdk/test/models/connector_metadata/__init__.py

[error] 8-8: ModuleNotFoundError: No module named 'airbyte_cdk.test.models.connector_metadata.generated.models'

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Manifest Server Docker Image Build
🔇 Additional comments (2)
airbyte_cdk/test/models/connector_metadata/__init__.py (2)

1-6: LGTM!

The docstring clearly documents that models are auto-generated and provides a helpful link to regeneration instructions. Nice documentation!


10-13: LGTM!

The __all__ definition correctly exposes the public API. Once the import timing issue on line 8 is resolved, this will work as intended.

aaronsteers and others added 6 commits October 27, 2025 13:07
- Switch from downloading individual YAML files via GitHub API to using sparse git clone
- Eliminates rate limiting issues (60 req/hour -> no API calls)
- Fix single-file model generation to properly filter out relative imports
- Add multi-line import block detection and filtering
- Generate and commit metadata models and consolidated JSON schema artifacts

Co-Authored-By: AJ Steers <aj@airbyte.io>
…schema.py

Co-Authored-By: AJ Steers <aj@airbyte.io>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
bin/generate_component_manifest_files.py (3)

40-73: Consider pinning to a stable ref for reproducible builds, wdyt?

The git clone approach nicely avoids GitHub API rate limits from the previous HTTP-based implementation. However, cloning without a ref pin will fetch the default branch (likely main or master), which means builds could break if upstream introduces schema changes or becomes temporarily unstable.

Consider adding a ref parameter (defaulting to a known-good commit SHA or tag):

-def clone_metadata_schemas(temp_dir: Path) -> Path:
+def clone_metadata_schemas(temp_dir: Path, ref: str = "a1b2c3d4e5f6") -> Path:
     """Clone metadata schema YAML files from GitHub using sparse checkout."""
     repo_url = "https://github.com/airbytehq/airbyte.git"
     schema_path = "airbyte-ci/connectors/metadata_service/lib/metadata_service/models/src"
     
     clone_dir = temp_dir / "airbyte"
     
     print("Cloning metadata schemas from airbyte repo...", file=sys.stderr)
     
     subprocess.run(
         [
             "git",
             "clone",
             "--depth",
             "1",
             "--filter=blob:none",
             "--sparse",
+            "--branch",
+            ref,
             repo_url,
             str(clone_dir),
         ],
         check=True,
         capture_output=True,
     )

Also note: this requires git to be available in the build environment.


176-226: Should post_process and metadata_models be mutually exclusive, wdyt?

The branching logic at lines 219-226 gives post_process precedence over metadata_models if both are True, but it's not immediately clear whether this is intentional or just defensive coding.

Consider making the intent explicit:

-    if post_process:
+    if post_process and metadata_models:
+        raise ValueError("post_process and metadata_models are mutually exclusive")
+    elif post_process:
         codegen_container = await post_process_codegen(codegen_container)
         await codegen_container.directory("/generated_post_processed").export(output_dir_path)
     elif metadata_models:

Or if the precedence is intentional, a docstring comment would help clarify the behavior.


399-418: Consider adding error handling for common clone failures, wdyt?

The clone_metadata_schemas call can fail in several ways (missing git, network issues, repository access problems), and users would benefit from actionable error messages.

Consider wrapping the metadata generation phase:

         print("\nGenerating metadata models...", file=sys.stderr)
         with tempfile.TemporaryDirectory() as temp_dir:
             temp_path = Path(temp_dir)
-            schemas_dir = clone_metadata_schemas(temp_path)
+            try:
+                schemas_dir = clone_metadata_schemas(temp_path)
+            except subprocess.CalledProcessError as e:
+                print(f"\nError: Failed to clone metadata schemas from GitHub.", file=sys.stderr)
+                print(f"Details: {e}", file=sys.stderr)
+                print("Tip: Ensure git is installed and you have network connectivity.", file=sys.stderr)
+                raise
+            except Exception as e:
+                print(f"\nError: Failed to clone metadata schemas: {e}", file=sys.stderr)
+                raise

This provides clearer guidance when builds fail.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 07d7014 and c63223a.

⛔ Files ignored due to path filters (2)
  • airbyte_cdk/test/models/connector_metadata/generated/metadata_schema.json is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/models.py is excluded by !**/generated/**
📒 Files selected for processing (2)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (13 hunks)
  • bin/generate_component_manifest_files.py (3 hunks)
✅ Files skipped from review due to trivial changes (1)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Manifest Server Docker Image Build
🔇 Additional comments (6)
bin/generate_component_manifest_files.py (6)

3-18: LGTM on the new imports and constant!

The additions support the new metadata models generation flow—subprocess for git operations, json/yaml for schema processing, and tempfile for temporary directories. All are well-chosen for the task.


29-30: Nice abstraction!

Parameterizing the directory lookup is a clean way to support both declarative and metadata schemas.


33-37: Good refactor for flexibility!

Passing in yaml_files rather than hardcoding the discovery makes this function more reusable across different schema sets.


157-173: Clean post-processing implementation!

The metadata models don't need the deprecation handling or parameter name fixes, so this simplified version is appropriate.


229-264: Schema consolidation looks good!

The approach of treating ConnectorMetadataDefinitionV0 as the root and other schemas as definitions is sensible, and the fallback for missing main schema is a nice safety net.


267-384: The generated file passes syntax validation, though full import verification is limited by the sandbox environment.

Good news: the syntax check passed successfully, confirming the consolidated Python file is well-formed. However, the full import test couldn't complete in this environment due to missing dependencies (dunamai), which is expected in the sandbox and not a reflection on your generated code.

The original concerns still hold merit though: the import consolidation logic is complex and depends on datamodel-codegen's output format staying stable. The multi-line import detection (checking for commas/parens on lines 334-335) should handle most cases, but if the tool's output format changes, the parsing logic may need adjustments.

Would it make sense to add a brief comment in the code noting this dependency on datamodel-codegen's format conventions, wdyt? And consider running a manual import test in your local environment to verify the full pipeline works end-to-end with your actual schemas?

devin-ai-integration bot and others added 3 commits October 27, 2025 20:29
- Exclude declarative_component_schema.py from formatting/linting
- Exclude connector_metadata generated models from formatting/linting
- Prevents accidental reformatting of auto-generated code

Co-Authored-By: AJ Steers <aj@airbyte.io>
- Remove $id fields from definitions to ensure refs resolve to root
- Fix .yaml refs to use #/definitions/ format
- Replace custom type references with $ref
- Skip invalid 'type: const' entries
- Merge nested definitions into top-level definitions

Co-Authored-By: AJ Steers <aj@airbyte.io>
- IDE validators expect definitions to not have $schema or $id fields
- These fields should only exist at the root level
- Fixes validation errors in IDEs when using the consolidated schema

Co-Authored-By: AJ Steers <aj@airbyte.io>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
mypy.ini (1)

24-26: Plugins configured under [tool.mypy] won’t load from mypy.ini; move them under [mypy], wdyt?

In mypy.ini, the plugins line should live in the [mypy] section. Otherwise pydantic and pytest-mypy-plugins won’t be active.

-[tool.mypy]
-plugins = ["pydantic.mypy", "pytest-mypy-plugins"]
+; Load plugins from the main mypy section
+[mypy]
+plugins = pydantic.mypy, pytest-mypy-plugins

Also, since you’re excluding generated code already, this won’t impact those paths. Want me to open a follow-up to tighten ignores after verifying plugin behavior? Wdyt?

♻️ Duplicate comments (1)
bin/generate_component_manifest_files.py (1)

40-74: Pin the schema ref for reproducible builds and allow override via env, wdyt?

Cloning the default branch makes builds non-deterministic. Could we pin to a commit/tag and support an env override (AIRBYTE_METADATA_SCHEMAS_REF) while keeping sparse checkout? Example:

+import os
@@
-def clone_metadata_schemas(temp_dir: Path) -> Path:
+def clone_metadata_schemas(temp_dir: Path) -> Path:
     """Clone metadata schema YAML files from GitHub using sparse checkout."""
-    repo_url = "https://github.com/airbytehq/airbyte.git"
+    repo_url = "https://github.com/airbytehq/airbyte.git"
+    # Pin to a specific ref (commit SHA or tag) for reproducibility; allow override
+    ref = os.environ.get("AIRBYTE_METADATA_SCHEMAS_REF", "master")
@@
-    subprocess.run(
-        [
-            "git",
-            "clone",
-            "--depth",
-            "1",
-            "--filter=blob:none",
-            "--sparse",
-            repo_url,
-            str(clone_dir),
-        ],
-        check=True,
-        capture_output=True,
-    )
+    subprocess.run(
+        [
+            "git",
+            "clone",
+            "--depth",
+            "1",
+            "--filter=blob:none",
+            "--sparse",
+            "--branch",
+            ref,
+            repo_url,
+            str(clone_dir),
+        ],
+        check=True,
+        capture_output=True,
+        text=True,
+    )
@@
-    subprocess.run(
-        ["git", "-C", str(clone_dir), "sparse-checkout", "set", schema_path],
-        check=True,
-        capture_output=True,
-    )
+    subprocess.run(
+        ["git", "-C", str(clone_dir), "sparse-checkout", "set", schema_path],
+        check=True,
+        capture_output=True,
+        text=True,
+    )

Optionally, would you want to support authenticated clones if GITHUB_TOKEN is set (to avoid anonymous fetch limits), e.g., by rewriting repo_url at runtime? Wdyt?

🧹 Nitpick comments (5)
bin/generate_component_manifest_files.py (4)

192-196: Include nested YAML files to avoid missing schemas, wdyt?

If schemas are nested in subfolders, the current include pattern will skip them. Shall we switch to a recursive glob?

-        .with_mounted_directory(
-            "/yaml", dagger_client.host().directory(yaml_dir_path, include=["*.yaml"])
-        )
+        .with_mounted_directory(
+            "/yaml", dagger_client.host().directory(yaml_dir_path, include=["**/*.yaml"])
+        )

Could you confirm whether upstream metadata schemas ever appear in subdirectories?


317-319: Match the recursive YAML include here too for consistency, wdyt?

-        .with_mounted_directory(
-            "/yaml", dagger_client.host().directory(yaml_dir_path, include=["*.yaml"])
-        )
+        .with_mounted_directory(
+            "/yaml", dagger_client.host().directory(yaml_dir_path, include=["**/*.yaml"])
+        )

18-18: Is the metadata output path under test/ intentional, or should this live under the new runtime package, wdyt?

Currently set to airbyte_cdk/test/models/connector_metadata/generated. If the intent is to ship models in airbyte_cdk.metadata_models, should the generator write there instead (and keep tests importing from that package)? Or is this path just for generation/testing? Can we clarify in README and possibly make it configurable?


49-62: Minor: avoid swallowing useful git stderr on failure, wdyt?

You capture output but never surface it unless an exception is caught upstream. If we keep capture_output=True, shall we print e.stderr in the except path (as in the earlier suggestion) so users see the git error?

mypy.ini (1)

15-16: Exclude regex looks fine; consider anchoring and normalizing separators, wdyt?

Optional nit: to reduce accidental matches, you could anchor and allow either slash:

-exclude = (unit_tests/|airbyte_cdk/sources/declarative/models/declarative_component_schema\.py|airbyte_cdk/test/models/connector_metadata/generated/)
+exclude = (^unit_tests/|^airbyte_cdk[\\/ ]sources[\\/ ]declarative[\\/ ]models[\\/ ]declarative_component_schema\.py|^airbyte_cdk[\\/ ]test[\\/ ]models[\\/ ]connector_metadata[\\/ ]generated/)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5373480 and 015a60e.

⛔ Files ignored due to path filters (2)
  • airbyte_cdk/test/models/connector_metadata/generated/metadata_schema.json is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/models.py is excluded by !**/generated/**
📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (13 hunks)
  • bin/generate_component_manifest_files.py (3 hunks)
  • mypy.ini (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2024-12-11T16:34:46.319Z
Learnt from: pnilan
PR: airbytehq/airbyte-python-cdk#0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.

Applied to files:

  • mypy.ini
🪛 GitHub Actions: Linters
bin/generate_component_manifest_files.py

[error] 237-285: Ruff formatting check produced diffs. 1 file would be reformatted. Exit code 1.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Manifest Server Docker Image Build
🔇 Additional comments (1)
bin/generate_component_manifest_files.py (1)

426-461: Ruff formatting verified and applied successfully.

The verification confirms that ruff found and fixed formatting issues in the file. The output shows "1 file reformatted" and "All checks passed!" — the formatting concern you raised has been addressed and the code now meets all style requirements.

The file is ready to commit (the git commit command failed only due to a sandbox environment limitation, not a code issue). Are you good to push this update, wdyt?

Comment on lines 239 to 304
all_schema_names = set(schemas.keys())

for schema_content in schemas.values():
if isinstance(schema_content, dict) and "definitions" in schema_content:
all_schema_names.update(schema_content["definitions"].keys())

def fix_refs(obj, in_definition=False):
"""Recursively fix $ref and type references in schema objects."""
if isinstance(obj, dict):
new_obj = {}
for key, value in obj.items():
if (key == "$id" or key == "$schema") and in_definition:
continue
elif key == "$ref" and isinstance(value, str):
if value.endswith(".yaml"):
schema_name = value.replace(".yaml", "")
new_obj[key] = f"#/definitions/{schema_name}"
else:
new_obj[key] = value
elif key == "type" and isinstance(value, str) and value in all_schema_names:
new_obj["$ref"] = f"#/definitions/{value}"
elif key == "type" and value == "const":
pass
else:
new_obj[key] = fix_refs(value, in_definition=in_definition)
return new_obj
elif isinstance(obj, list):
return [fix_refs(item, in_definition=in_definition) for item in obj]
else:
return obj

# Find the main schema (ConnectorMetadataDefinitionV0)
main_schema = schemas.get("ConnectorMetadataDefinitionV0")

if main_schema:
# Create a consolidated schema with definitions
consolidated = {
"$schema": main_schema.get("$schema", "http://json-schema.org/draft-07/schema#"),
"title": "Connector Metadata Schema",
"description": "Consolidated JSON schema for Airbyte connector metadata validation",
**main_schema,
"definitions": {},
}

# Add all schemas (including their internal definitions) as top-level definitions
for schema_name, schema_content in schemas.items():
if schema_name != "ConnectorMetadataDefinitionV0":
if isinstance(schema_content, dict) and "definitions" in schema_content:
for def_name, def_content in schema_content["definitions"].items():
consolidated["definitions"][def_name] = fix_refs(def_content, in_definition=True)
schema_without_defs = {k: v for k, v in schema_content.items() if k != "definitions"}
consolidated["definitions"][schema_name] = fix_refs(schema_without_defs, in_definition=True)
else:
consolidated["definitions"][schema_name] = fix_refs(schema_content, in_definition=True)

consolidated = fix_refs(consolidated, in_definition=False)

Path(output_json_path).write_text(json.dumps(consolidated, indent=2))
print(f"Generated consolidated JSON schema: {output_json_path}", file=sys.stderr)
else:
print(
"Warning: ConnectorMetadataDefinitionV0 not found, generating simple consolidation",
file=sys.stderr,
)
Path(output_json_path).write_text(json.dumps(schemas, indent=2))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix consolidation correctness: preserve main definitions, avoid clobbering JSON primitives, and support .yaml refs with fragments, wdyt?

Two issues:

  • Main schema’s internal definitions are overwritten, then never re-added.
  • Converting any type that matches a schema name into a $ref can break primitives like object/string.

Patch proposal:

@@
-    all_schema_names = set(schemas.keys())
-    
+    all_schema_names = set(schemas.keys())
+    json_primitives = {"string", "number", "integer", "boolean", "object", "array", "null"}
@@
-    def fix_refs(obj, in_definition=False):
+    def fix_refs(obj, in_definition=False):
         """Recursively fix $ref and type references in schema objects."""
         if isinstance(obj, dict):
             new_obj = {}
             for key, value in obj.items():
                 if (key == "$id" or key == "$schema") and in_definition:
                     continue
                 elif key == "$ref" and isinstance(value, str):
-                    if value.endswith(".yaml"):
-                        schema_name = value.replace(".yaml", "")
-                        new_obj[key] = f"#/definitions/{schema_name}"
+                    # Map file refs like Foo.yaml or Foo.yaml#/Bar to #/definitions/Foo[#/Bar]
+                    m = re.match(r"(?:.*/)?(?P<name>[^/#]+)\.yaml(?P<frag>#.*)?$", value)
+                    if m:
+                        schema_name = m.group("name")
+                        frag = m.group("frag") or ""
+                        new_obj[key] = f"#/definitions/{schema_name}{frag}"
                     else:
                         new_obj[key] = value
-                elif key == "type" and isinstance(value, str) and value in all_schema_names:
-                    new_obj["$ref"] = f"#/definitions/{value}"
+                elif key == "type" and isinstance(value, str):
+                    # Only rewrite to $ref if it's not a JSON primitive
+                    if value in all_schema_names and value not in json_primitives:
+                        new_obj["$ref"] = f"#/definitions/{value}"
+                    else:
+                        new_obj[key] = value
                 elif key == "type" and value == "const":
                     pass
                 else:
                     new_obj[key] = fix_refs(value, in_definition=in_definition)
             return new_obj
@@
-    if main_schema:
+    if main_schema:
         # Create a consolidated schema with definitions
-        consolidated = {
-            "$schema": main_schema.get("$schema", "http://json-schema.org/draft-07/schema#"),
-            "title": "Connector Metadata Schema",
-            "description": "Consolidated JSON schema for Airbyte connector metadata validation",
-            **main_schema,
-            "definitions": {},
-        }
+        consolidated = dict(main_schema)  # shallow copy
+        consolidated.setdefault("$schema", "http://json-schema.org/draft-07/schema#")
+        consolidated.setdefault("title", "Connector Metadata Schema")
+        consolidated.setdefault("description", "Consolidated JSON schema for Airbyte connector metadata validation")
+        # Preserve existing main-schema definitions
+        consolidated_definitions = dict(consolidated.get("definitions", {}))
@@
-        for schema_name, schema_content in schemas.items():
-            if schema_name != "ConnectorMetadataDefinitionV0":
+        for schema_name, schema_content in schemas.items():
+            if schema_name != "ConnectorMetadataDefinitionV0":
                 if isinstance(schema_content, dict) and "definitions" in schema_content:
                     for def_name, def_content in schema_content["definitions"].items():
-                        consolidated["definitions"][def_name] = fix_refs(def_content, in_definition=True)
+                        consolidated_definitions[def_name] = fix_refs(def_content, in_definition=True)
                     schema_without_defs = {k: v for k, v in schema_content.items() if k != "definitions"}
-                    consolidated["definitions"][schema_name] = fix_refs(schema_without_defs, in_definition=True)
+                    consolidated_definitions[schema_name] = fix_refs(schema_without_defs, in_definition=True)
                 else:
-                    consolidated["definitions"][schema_name] = fix_refs(schema_content, in_definition=True)
+                    consolidated_definitions[schema_name] = fix_refs(schema_content, in_definition=True)
+
+        consolidated["definitions"] = consolidated_definitions
@@
-        consolidated = fix_refs(consolidated, in_definition=False)
+        consolidated = fix_refs(consolidated, in_definition=False)

Would you like me to add a small sanity check that ensures $ref targets exist in definitions and warn if not, to aid debugging? Wdyt?

🤖 Prompt for AI Agents
In bin/generate_component_manifest_files.py around lines 239-304, the
consolidation currently overwrites the main schema's internal definitions and
naively converts any "type" string matching a schema name into a $ref (breaking
primitives), and it only handles .yaml refs without fragments. Preserve the main
schema's original "definitions" by merging them into consolidated["definitions"]
before adding other schemas; when converting "type" to "$ref", only do so if the
type is not a JSON primitive (object, array, string, number, integer, boolean,
null) and the target definition actually exists; support .yaml refs with
fragments by parsing values like "other.yaml#/definitions/X" into
"#/definitions/X" (and strip just ".yaml" when there is no fragment); and add an
optional sanity check that warns if a $ref target is missing from
consolidated["definitions"] to aid debugging.

Comment on lines 439 to 457
with tempfile.TemporaryDirectory() as temp_dir:
temp_path = Path(temp_dir)
schemas_dir = clone_metadata_schemas(temp_path)

output_dir = Path(LOCAL_METADATA_OUTPUT_DIR_PATH)
output_dir.mkdir(parents=True, exist_ok=True)

print("Generating single Python file with all models...", file=sys.stderr)
output_file = str(output_dir / "models.py")
await generate_metadata_models_single_file(
dagger_client=dagger_client,
yaml_dir_path=str(schemas_dir),
output_file_path=output_file,
)

print("Generating consolidated JSON schema...", file=sys.stderr)
json_schema_file = str(output_dir / "metadata_schema.json")
consolidate_yaml_schemas_to_json(schemas_dir, json_schema_file)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wrap clone and codegen in helpful error handling with clear remediation tips, wdyt?

A failed git clone/sparse-checkout currently raises without guidance. Could we catch errors and print actionable hints (set AIRBYTE_METADATA_SCHEMAS_REF, check network, optional GITHUB_TOKEN), then re-raise?

         print("\nGenerating metadata models...", file=sys.stderr)
         with tempfile.TemporaryDirectory() as temp_dir:
             temp_path = Path(temp_dir)
-            schemas_dir = clone_metadata_schemas(temp_path)
+            try:
+                schemas_dir = clone_metadata_schemas(temp_path)
+            except subprocess.CalledProcessError as e:
+                print("\nError: Failed to clone metadata schemas.", file=sys.stderr)
+                if e.stderr:
+                    print(e.stderr, file=sys.stderr)
+                print("Tips:", file=sys.stderr)
+                print(" - Ensure network access from CI.", file=sys.stderr)
+                print(" - Pin a stable ref via AIRBYTE_METADATA_SCHEMAS_REF=<commit|tag>.", file=sys.stderr)
+                print(" - Optionally use GITHUB_TOKEN for higher reliability.", file=sys.stderr)
+                raise
🤖 Prompt for AI Agents
In bin/generate_component_manifest_files.py around lines 439 to 457, wrap the
clone_metadata_schemas call and the subsequent codegen steps
(generate_metadata_models_single_file and consolidate_yaml_schemas_to_json) in a
try/except block that catches Exception, logs a clear, actionable error message
that suggests checking/setting AIRBYTE_METADATA_SCHEMAS_REF, verifying network
access, and optionally providing GITHUB_TOKEN for authenticated checkout, then
re-raises the exception; ensure the TemporaryDirectory context is preserved so
cleanup still occurs and keep the original exception chained when re-raising for
debugging.

- Created new bin/generate_connector_metadata_files.py for metadata models
- Reverted bin/generate_component_manifest_files.py to main (no changes)
- Updated build script to call both generators
- Keeps concerns separated for easier review and maintenance

Co-Authored-By: AJ Steers <aj@airbyte.io>
set -e

python bin/generate_component_manifest_files.py
python bin/generate_connector_metadata_files.py
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin, let's revert this and move it upstream - maybe into to poe task?

- Added assemble-declarative and assemble-metadata poe tasks
- Reverted shell script to main (no changes)
- Updated assemble task to call both generators
- Addresses PR feedback to move script call upstream

Co-Authored-By: AJ Steers <aj@airbyte.io>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (5)
bin/generate_connector_metadata_files.py (4)

120-132: Import bucketing misses typing_extensions.

Generated code often uses typing_extensions (e.g., Literal, TypedDict). It’ll fall through into class content. Should we treat it as third‑party imports to keep imports together? wdyt?

Apply this diff:

-                    elif line.startswith("from pydantic") or line.startswith("import "):
+                    elif (
+                        line.startswith("from pydantic")
+                        or line.startswith("from typing_extensions")
+                        or line.startswith("import ")
+                    ):
                         third_party_imports.add(line)

156-159: Also rewrite “import pydantic” to v1 for completeness.

If any generated file uses “import pydantic”, it won’t be redirected to pydantic.v1. Shall we handle both forms with a regex? wdyt?

Apply this diff (and add the import):

@@
-import json
+import json
+import re
@@
-    post_processed_content = final_content.replace("from pydantic", "from pydantic.v1")
+    post_processed_content = re.sub(r"\bfrom pydantic\b", "from pydantic.v1", final_content)
+    post_processed_content = re.sub(
+        r"(?m)^\s*import pydantic\b", "import pydantic.v1 as pydantic", post_processed_content
+    )

85-99: Optionally pin target Python for codegen.

Since we run on python:3.10, would adding “--target-python-version 3.10” reduce surprises from default inference? wdyt?

Apply this diff:

             "datamodel-codegen",
             "--input",
             "/yaml",
             "--output",
             "/generated_temp",
+            "--target-python-version",
+            "3.10",

33-65: Support local schema overrides and ref pinning to improve CI determinism and resilience.

Currently, the schema clone always fetches from the airbyte monorepo at depth 1 without any control over the git ref, which can be unpredictable in CI and vulnerable to GitHub rate limits. Adding environment variable support would help:

  1. Local override (AIRBYTE_SCHEMAS_DIR): Useful for testing and offline builds
  2. Ref pinning (AIRBYTE_SCHEMAS_REF): Enables reproducible builds against specific commits/branches
  3. Future token support: Would mitigate 403 rate-limit errors from GitHub API

Here's a minimal path forward—would adding these env var checks help your CI workflow, wdyt?

+import os
 
 def clone_schemas_from_github(temp_dir: Path) -> Path:
     """Clone metadata schema YAML files from GitHub using sparse checkout."""
     clone_dir = temp_dir / "airbyte"
+    local_override = Path(os.getenv("AIRBYTE_SCHEMAS_DIR", "")) if os.getenv("AIRBYTE_SCHEMAS_DIR") else None
+    ref = os.getenv("AIRBYTE_SCHEMAS_REF")  # e.g., a commit SHA or branch
+    if local_override and local_override.exists():
+        print(f"Using local schemas from {local_override}", file=sys.stderr)
+        return local_override
 
     print("Cloning metadata schemas from airbyte repo...", file=sys.stderr)
 
     subprocess.run(
         [
             "git",
             "clone",
             "--depth",
             "1",
             "--filter=blob:none",
             "--sparse",
+            *(["--branch", ref] if ref else []),
             AIRBYTE_REPO_URL,
             str(clone_dir),
         ],
         check=True,
         capture_output=True,
     )

Then document AIRBYTE_SCHEMAS_DIR and AIRBYTE_SCHEMAS_REF in the README—and optionally test with AIRBYTE_SCHEMAS_REF=main python bin/generate_connector_metadata_files.py to confirm the behavior is what you expect.

bin/generate-component-manifest-dagger.sh (1)

8-11: Harden the bash script with strict flags.

To fail fast and catch unset vars/pipes, shall we use -Eeuo pipefail? wdyt?

Apply this diff:

-set -e
+set -Eeuo pipefail
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 015a60e and fe4b9cc.

📒 Files selected for processing (2)
  • bin/generate-component-manifest-dagger.sh (1 hunks)
  • bin/generate_connector_metadata_files.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
bin/generate_connector_metadata_files.py (1)
bin/generate_component_manifest_files.py (1)
  • main (113-155)
🪛 GitHub Actions: Linters
bin/generate_connector_metadata_files.py

[error] 243-251: Ruff format check failed: 1 file would be reformatted (bin/generate_connector_metadata_files.py). Ruff exited with code 1.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: SDM Docker Image Build

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pyproject.toml (1)

147-149: Consider adding GitHub rate-limit mitigation to build configuration.

Per the PR description, assemble-metadata downloads JSON Schema YAML files from GitHub on-demand without authentication, which can hit rate limits (~60 requests/hour for unauthenticated requests). This could cause CI builds to fail intermittently, especially in fork workflows or frequent rebuilds.

Would it be worth documenting or implementing one of these mitigations in the build config?

  • Environment variable for GitHub token: GITHUB_TOKEN → pass to the generation script
  • Local caching/fallback mechanism for schema files
  • Retry logic with exponential backoff (already available via backoff package, which is a dependency)

Consider documenting the optional GITHUB_TOKEN environment variable in a build or generation script, and potentially storing schema files in a cache directory that persists across builds (e.g., .cache/schemas/). This is especially important for CI/CD workflows.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe4b9cc and 66d4eeb.

📒 Files selected for processing (1)
  • pyproject.toml (1 hunks)
🔇 Additional comments (1)
pyproject.toml (1)

147-149: Task sequencing looks correct; confirm implicit dependency is intended.

The assemble task now sequences assemble-declarativeassemble-metadata. This order makes sense logically, but is there an actual data dependency (i.e., does the metadata generation require outputs from declarative generation)? If not, consider documenting the rationale or allowing them to run in parallel for faster builds.

Could you clarify: Does assemble-metadata depend on any outputs from assemble-declarative, or are they independent? Wdyt?

Comment on lines 73 to 81
codegen_container = (
dagger_client.container()
.from_(PYTHON_IMAGE)
.with_exec(["mkdir", "-p", "/generated_temp"], use_entrypoint=True)
.with_exec(["pip", "install", " ".join(PIP_DEPENDENCIES)], use_entrypoint=True)
.with_mounted_directory(
"/yaml", dagger_client.host().directory(yaml_dir_path, include=["*.yaml"])
)
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin, we should not use Dagger here (read: never use Dagger in new code). You can use uvx if you need to run with something that isn't part of our project's dev dependencies.

Or, you can just use uv script syntax with PEP 723 for inline dependency declarations for any libraries you need this script to have direct access to call. (I don't know if that solves for CLI based invocation from the script though.)



async def main():
async with dagger.Connection(dagger.Config(log_output=sys.stderr)) as dagger_client:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Dagger disallowed in new code.

devin-ai-integration bot and others added 2 commits October 27, 2025 21:13
- Removed Dagger dependency from new script
- Use uvx to run datamodel-codegen directly
- Simplified script significantly (no async, no container orchestration)
- Script is now ~280 lines vs ~290 lines with Dagger
- Addresses PR feedback to avoid Dagger in new code

Co-Authored-By: AJ Steers <aj@airbyte.io>
- Changed order of operations: YAML → JSON → Python (instead of YAML → Python → merge)
- Removed all manual Python file merging logic (~100 lines of code removed)
- datamodel-codegen now generates single file directly from consolidated JSON
- Much simpler and more maintainable
- Addresses AJ's feedback to avoid manual Python code manipulation

Co-Authored-By: AJ Steers <aj@airbyte.io>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
bin/generate_connector_metadata_files.py (4)

1-281: Run ruff format to fix formatting issues.

The pipeline is blocked because ruff format detected formatting changes. This was previously flagged and needs to be addressed.

Run the following locally to fix:

ruff format bin/generate_connector_metadata_files.py

183-183: Use recursive glob to catch nested YAML schemas.

Only top-level *.yaml files are matched here, which risks missing nested schemas. This was previously flagged—would switching to .rglob("*.yaml") work? wdyt?

Apply this diff:

-    for yaml_file in yaml_dir_path.glob("*.yaml"):
+    for yaml_file in sorted(yaml_dir_path.rglob("*.yaml")):

201-206: Handle $ref fragments like "Foo.yaml#/definitions/Bar".

References with fragments (e.g., "SomeSchema.yaml#/definitions/SomeType") won't be rewritten correctly. This was previously flagged—should we handle the fragment portion explicitly? wdyt?

Consider applying this diff to handle fragments:

                 elif key == "$ref" and isinstance(value, str):
-                    if value.endswith(".yaml"):
-                        schema_name = value.replace(".yaml", "")
-                        new_obj[key] = f"#/definitions/{schema_name}"
+                    if ".yaml" in value:
+                        # Handle both "Foo.yaml" and "Foo.yaml#/definitions/Bar"
+                        base, sep, fragment = value.partition("#")
+                        schema_name = base.replace(".yaml", "")
+                        if fragment:
+                            # Preserve fragment after rewriting base
+                            new_obj[key] = f"#/definitions/{schema_name}{sep}{fragment}"
+                        else:
+                            new_obj[key] = f"#/definitions/{schema_name}"
                     else:
                         new_obj[key] = value

224-242: Merge main schema's internal definitions into the consolidated output.

The spread operator on line 228 will overwrite the custom title and description, and the main schema's internal definitions aren't being merged into consolidated["definitions"]. This was previously flagged—should we reorder the spread and explicitly merge the main schema's definitions? wdyt?

Apply this diff:

         # Create a consolidated schema with definitions
         consolidated = {
+            **main_schema,
             "$schema": main_schema.get("$schema", "http://json-schema.org/draft-07/schema#"),
             "title": "Connector Metadata Schema",
             "description": "Consolidated JSON schema for Airbyte connector metadata validation",
-            **main_schema,
             "definitions": {},
         }
+        
+        # Preserve any internal definitions from the main schema
+        if "definitions" in main_schema:
+            for def_name, def_content in main_schema["definitions"].items():
+                consolidated["definitions"][def_name] = fix_refs(def_content, in_definition=True)
 
         # Add all schemas (including their internal definitions) as top-level definitions
         for schema_name, schema_content in schemas.items():
🧹 Nitpick comments (2)
bin/generate_connector_metadata_files.py (2)

255-278: Consider adding error handling for network failures.

The script will crash immediately on any subprocess failure. While check=True ensures failures are caught, there's no retry logic for transient network issues or validation that the cloned directory contains schemas. Would it be worth adding basic error handling and validation, or is fail-fast acceptable for this build script? wdyt?

Example validation you could add:

     with tempfile.TemporaryDirectory() as temp_dir:
         temp_path = Path(temp_dir)
         schemas_dir = clone_schemas_from_github(temp_path)
+        
+        # Validate that we actually got some schemas
+        yaml_files = list(schemas_dir.rglob("*.yaml"))
+        if not yaml_files:
+            print(f"Error: No YAML schemas found in {schemas_dir}", file=sys.stderr)
+            sys.exit(1)
+        print(f"Found {len(yaml_files)} YAML schema file(s)", file=sys.stderr)

99-176: The import filtering logic works but has some duplication.

The logic for detecting and filtering relative import blocks appears twice (lines 114-138 and 155-173). Both sections handle multi-line imports ending with ) or ,. Would consolidating this into a helper function make the code more maintainable? wdyt?

Example helper:

def is_relative_import_line(line: str) -> bool:
    """Check if a line is part of a relative import statement."""
    stripped = line.strip()
    return stripped.startswith("from .") or stripped.startswith("import .")

def skip_relative_import_block(lines: list[str], start_idx: int) -> int:
    """Skip a relative import block and return the index after it."""
    if not lines[start_idx].rstrip().endswith(",") and not lines[start_idx].rstrip().endswith("("):
        return start_idx + 1
    
    idx = start_idx + 1
    while idx < len(lines):
        if lines[idx].strip().endswith(")"):
            return idx + 1
        idx += 1
    return idx
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 66d4eeb and 23837eb.

📒 Files selected for processing (1)
  • bin/generate_connector_metadata_files.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
bin/generate_connector_metadata_files.py (1)
bin/generate_component_manifest_files.py (1)
  • main (113-155)
🪛 GitHub Actions: Linters
bin/generate_connector_metadata_files.py

[error] 1-1: ruff format --diff detected formatting changes required. 1 file would be reformatted. Run 'ruff format' to fix code style issues.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Manifest Server Docker Image Build
🔇 Additional comments (3)
bin/generate_connector_metadata_files.py (3)

76-97: Good use of uvx instead of Dagger!

The implementation correctly uses uvx to run datamodel-codegen with a pinned version, which aligns with the feedback to avoid Dagger in new code. The command-line arguments look appropriate for generating the models.


20-24: Good practice checking for pyyaml availability.

The try/except block with a clear error message is helpful for users who might be missing the dependency.


207-210: Verification complete—the type→$ref conversion is correct.

The script found actual examples in Airbyte's schema YAML files: ConnectorMetrics.yaml uses type: ConnectorMetric (a schema name, not a JSON Schema primitive). This confirms the conversion logic is intentional and necessary for handling Airbyte's custom schema structure. The code correctly transforms these schema name references into $ref pointers.

devin-ai-integration bot and others added 2 commits October 27, 2025 21:20
- Preserve main schema's internal definitions (don't overwrite them)
- Only convert custom types to $ref, not JSON primitives (string, object, etc.)
- Support YAML refs with fragments (e.g., Foo.yaml#/Bar)
- Use regex to properly parse .yaml references with optional fragments

Addresses CodeRabbit's critical feedback on bin/generate_connector_metadata_files.py

Co-Authored-By: AJ Steers <aj@airbyte.io>
- Added type hints to fix_refs function and main function
- Fixed ruff formatting issues
- All mypy checks now pass for the new script

Co-Authored-By: AJ Steers <aj@airbyte.io>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
bin/generate_connector_metadata_files.py (3)

40-53: Consider adding timeout to subprocess calls.

The git clone operation has no timeout, so it could hang indefinitely if the GitHub server is unresponsive. Would it make sense to add a reasonable timeout (e.g., 300 seconds) to avoid hanging the build? wdyt?

Apply this diff:

     subprocess.run(
         [
             "git",
             "clone",
             "--depth",
             "1",
             "--filter=blob:none",
             "--sparse",
             AIRBYTE_REPO_URL,
             str(clone_dir),
         ],
         check=True,
         capture_output=True,
+        timeout=300,
     )

Similarly for the sparse-checkout command at line 55-59.


163-186: Consider adding timeout to subprocess call.

Similar to the git operations, the datamodel-codegen invocation has no timeout. Would it make sense to add one to prevent hanging builds? wdyt?

Apply this diff:

     subprocess.run(
         [
             "uvx",
             "--from",
             f"datamodel-code-generator=={DATAMODEL_CODEGEN_VERSION}",
             "datamodel-codegen",
             "--input",
             str(json_schema_path),
             "--output",
             str(output_file_path),
             "--input-file-type",
             "jsonschema",
             "--disable-timestamp",
             "--enum-field-as-literal",
             "one",
             "--set-default-enum-member",
             "--use-double-quotes",
             "--remove-special-field-name-prefix",
             "--field-extra-keys",
             "deprecated",
             "deprecation_message",
         ],
         check=True,
+        timeout=600,
     )

188-190: Fragile pydantic import rewriting could break in edge cases.

The simple string replacement content.replace("from pydantic", "from pydantic.v1") will incorrectly modify comments, docstrings, or string literals containing that phrase. While unlikely with generated code, would a regex-based approach targeting actual import statements be more robust? wdyt?

Apply this diff to use a more targeted regex replacement:

     content = output_file_path.read_text()
-    content = content.replace("from pydantic", "from pydantic.v1")
+    # Rewrite pydantic imports to use v1 compatibility layer
+    content = re.sub(
+        r'^(\s*)from pydantic import',
+        r'\1from pydantic.v1 import',
+        content,
+        flags=re.MULTILINE
+    )
+    content = re.sub(
+        r'^(\s*)import pydantic\b',
+        r'\1import pydantic.v1 as pydantic',
+        content,
+        flags=re.MULTILINE
+    )
     output_file_path.write_text(content)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 23837eb and 3c2a4f8.

⛔ Files ignored due to path filters (2)
  • airbyte_cdk/test/models/connector_metadata/generated/metadata_schema.json is excluded by !**/generated/**
  • airbyte_cdk/test/models/connector_metadata/generated/models.py is excluded by !**/generated/**
📒 Files selected for processing (1)
  • bin/generate_connector_metadata_files.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
bin/generate_connector_metadata_files.py (1)
bin/generate_component_manifest_files.py (1)
  • main (113-155)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (2)
bin/generate_connector_metadata_files.py (2)

195-214: LGTM! Clean orchestration logic.

The main function properly uses a temporary directory context manager and orchestrates the steps logically with helpful progress messages to stderr.


103-104: Remove or document the unused type == "const" branch.

The code silently drops entries where type equals "const" (lines 103-104), but our search found zero instances of this pattern in any source YAML schemas. Either this is dead code that can be removed, or it's defensive handling for a malformed edge case that should be documented with a comment explaining why it exists and how it should be handled. Which is it? If this is intentional, would you mind adding a comment clarifying the edge case and perhaps logging a warning when it occurs? That way future maintainers understand the rationale. wdyt?

Likely an incorrect or invalid review comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants