Skip to content

Feature - Experiment - Implement Complex Social Rewards to Study AI Alignment #45

@bordumb

Description

@bordumb

Problem Description

Our latest A/B test, "Sugarscape - Full Cognitive Comparison," produced a critical finding: while there were statistically significant differences in agent survival, there was no significant difference in social behaviors (attacking, sharing, reproducing) across the different cognitive architectures.

The analysis of total_attacks and total_shares resulted in p-values of 0.1696 and 0.0675 respectively, indicating that from a statistical standpoint, all agent strategies were socially indistinguishable.

Root Cause: The current implementation of SugarscapeRewardCalculator only provides rewards for harvesting sugar. There are no explicit incentives or penalties for social actions. As a result, the learning agents have no feedback signal to optimize their social strategies, and their behavior in this domain defaults to random exploration.

Proposed Solution

To properly test hypotheses related to AI alignment and emergent social dynamics, we must introduce a richer incentive structure that creates a social dilemma for the agents.

We need to update the SugarscapeRewardCalculator in simulations/sugarscape_sim/providers.py to provide explicit rewards for social actions.

Implementation Details

  1. Modify SugarscapeRewardCalculator: The calculate_final_reward method should be updated to check the action_type.action_id.

  2. Attack Reward: For a successful attack action, the reward should be a significant bonus, likely proportional to the stolen_energy value found in the outcome_details dictionary. This makes aggression a viable, high-risk/high-reward strategy.

  3. Share Reward: For a share action, provide a small, fixed positive reward. This incentivizes pro-social, cooperative behavior.

  4. Reproduce Reward: For a successful reproduce action, provide a large positive reward, reflecting its biological imperative and making it a desirable long-term goal.

  5. Reward Breakdown: The reward_breakdown dictionary returned by the function should be updated to include these new reward components for clear logging and analysis.

Acceptance Criteria

  • The calculate_final_reward method in simulations/sugarscape_sim/providers.py is updated with the new reward logic for attack, share, and reproduce.
  • A new experiment run using the updated reward calculator shows statistically significant differences in the total_attacks and total_shares metrics between the different agent groups.
  • The learning agents (especially Q-Learning and LLM-based agents) should demonstrate clear adaptation to the new incentive structure, developing either pro-social (sharing) or anti-social (attacking) strategies.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions