-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Description
Given a small sample of structured data (e.g., 100 JSON records), generate a much larger dataset (e.g., 1,000 records) that is statistically and semantically similar to the original.
Constraints
-
Maintaining Statistical Distribution: the LLM must not just copy the types of values, but also their frequency. If 20% of users in the original set are "admin" and 80% are "user," the 1,000-record set should reflect this ratio. This is very difficult for an LLM, which naturally follows linguistic, not statistical, probability.
-
Preserving Correlations: the LLM must learn implicit rules. For example, "if plan_type is 'Free', storage_limit is always '1GB', but if plan_type is 'Pro', storage_limit is '10GB' or '50GB'." It needs to generate new, valid combinations of these correlated fields.
-
Avoiding Repetition: the generated records must be novel and not just slight rephrasings or duplicates of the original 100.
Before start this issue suggest a solution and wait for the approval.