Skip to content

🚀 Feat: template structured data expansion #37

@nicofretti

Description

@nicofretti

Description

Given a small sample of structured data (e.g., 100 JSON records), generate a much larger dataset (e.g., 1,000 records) that is statistically and semantically similar to the original.

Constraints

  • Maintaining Statistical Distribution: the LLM must not just copy the types of values, but also their frequency. If 20% of users in the original set are "admin" and 80% are "user," the 1,000-record set should reflect this ratio. This is very difficult for an LLM, which naturally follows linguistic, not statistical, probability.

  • Preserving Correlations: the LLM must learn implicit rules. For example, "if plan_type is 'Free', storage_limit is always '1GB', but if plan_type is 'Pro', storage_limit is '10GB' or '50GB'." It needs to generate new, valid combinations of these correlated fields.

  • Avoiding Repetition: the generated records must be novel and not just slight rephrasings or duplicates of the original 100.

Before start this issue suggest a solution and wait for the approval.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions