Prompt Entropy Experiment

"The limits of my language mean the limits of my world."
— Ludwig Wittgenstein, Philosophical Investigations

Prompt Entropy Experiment

Empirical validation of information-theoretic principles in prompt engineering for generative AI systems.

Research Question

Can we quantify prompt quality using Shannon entropy and mutual information? Does the effect persist across different sampling regimes (temperature settings)?

Hypotheses

H1 (Primary): Specification-driven prompts reduce output entropy across all temperatures H2: Entropy increases monotonically with temperature (validation) H3: Interaction effect - Does the entropy difference persist, amplify, or converge with temperature? H4: Mutual information correlates negatively with entropy across all temperatures

Experimental Design

Multi-temperature study across:

30 tasks spanning 6 domains
2 prompt types (specification-driven vs. vague)
2 models (GPT-4, Claude-3.5 Sonnet)
3 temperatures (0.7 production, 1.0 baseline, 1.2 exploration)
30 samples per condition

Total planned generations: 10,800 (30 tasks × 2 prompts × 2 models × 3 temps × 30 samples)

Domains Studied

Technical Programming
Data Analysis
Business Analysis
Technical Writing
Creative Writing
Explanatory Content

Repository Structure

prompt-entropy-experiment/
├── paper/                 # LaTeX source for academic paper
├── data/
│   ├── raw/              # Raw generation samples
│   └── processed/        # Computed metrics and analysis
├── notebooks/            # Jupyter notebooks for analysis
├── src/                  # Python source code
│   ├── metrics/          # Entropy and MI calculation
│   ├── sampling/         # LLM sampling utilities
│   └── analysis/         # Statistical analysis
├── results/              # Statistical results and tables
└── figures/              # Generated plots and visualizations

Temperature Framework

The study uses three strategic temperatures to validate findings across sampling regimes:

T=0.7 (Production): Real-world deployment setting, practical relevance
T=1.0 (Baseline): Natural, unscaled probability distribution, theoretical purity
T=1.2 (Exploration): Latent space exploration, robustness validation

This design tests whether the entropy-reducing effect of specification prompts is:

Production-valid: Does it hold in real-world settings? (0.7)
Theoretically sound: What is the effect at the pure distribution? (1.0)
Robust: Does it persist during latent exploration? (1.2)

Methodology

Entropy Metrics

Token Entropy: Shannon entropy over token distributions
Semantic Entropy: Clustering-based entropy in embedding space
Structural Entropy: Entropy over structural features

Mutual Information Estimation

MI Content: Information content indicators (numbers, constraints, examples)
MI Coverage: Task concept coverage
MI Semantic: Embedding similarity
MI Combined: Weighted combination

Quality Evaluation

Correctness (35%)
Completeness (25%)
Relevance (20%)
Coherence (10%)
Format Compliance (10%)

Quick Start

# 1. Clone and setup
git clone https://github.com/ibrahimcesar/prompt-entropy-experiment.git
cd prompt-entropy-experiment
make setup

# 2. Configure API keys
make init-env  # Creates .env template
# Edit .env with your API keys

# 3. Run experiment
make run-experiment EXPERIMENT=exp001 CONFIG=config/tasks.json

# Or run multi-temperature study (recommended)
make run-temperature-study EXPERIMENT=temp_comprehensive

Usage

Command-Line (Recommended)

# Full 3-temperature study (production/baseline/exploration)
make run-temperature-study EXPERIMENT=temp001

# Single temperature baseline
make run-experiment EXPERIMENT=baseline TEMPERATURE=1.0

# Quick pilot study (3 tasks, 5 samples per condition)
make run-temperature-study-small EXPERIMENT=pilot

Python API

from src.metrics import calculate_entropy, estimate_mutual_information
from src.sampling import sample_responses

# Sample responses from LLM
responses = sample_responses(
    prompt=prompt,
    model="gpt-4",
    n=30,
    temperature=1.0
)

# Calculate entropy
entropy = calculate_entropy(responses, metric="token")

# Estimate mutual information
mi = estimate_mutual_information(prompt, task)

All experiments include full audit logging with git state, parameters, file hashes, and timestamps for reproducibility.

Documentation

Comprehensive methodology documentation:

METHODOLOGY.md: Complete experimental protocol with formal hypotheses, temperature framework, and reproducibility guidelines
QUICKSTART.md: Quick reference guide
CONTRIBUTING.md: Contribution guidelines

Academic paper:

LaTeX source: paper/prompt_entropy_paper.tex
PDF: paper/prompt_entropy_paper.pdf (after compilation)

Compile Paper

cd paper
pdflatex prompt_entropy_paper.tex
bibtex prompt_entropy_paper
pdflatex prompt_entropy_paper.tex
pdflatex prompt_entropy_paper.tex

Citation

@article{cesar2025prompt,
  title={Information-Theoretic Analysis of Prompt Engineering:
         Multi-Temperature Validation of Entropy and Mutual Information Effects},
  author={Cesar, Ibrahim},
  journal={arXiv preprint},
  year={2025}
}

Project Status

Data Collection Phase: Implementing multi-temperature experimental design to validate the robustness of entropy-based prompt quality metrics across different sampling regimes.

Next Steps:

Complete 3-temperature data collection (T=0.7, 1.0, 1.2)
Statistical analysis of main effects and interactions
Publication preparation

This study serves as foundational research before deeper integration into the Categorical Operations Management framework.

Author

Ibrahim Cesar
Independent Researcher
São Paulo, Brazil

Email: ibrahim@ibrahimcesar.com
Web: https://ibrahimcesar.com
ORCID: 0009-0006-9954-659X

License

MIT License - see LICENSE for details

Acknowledgments

Thanks to the broader AI research community for developing the theoretical foundations this work builds upon, and to Anthropic and OpenAI for making Claude-3.5 Sonnet and GPT-4 available for research.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
config		config
data		data
figures		figures
logs		logs
notebooks		notebooks
paper		paper
results		results
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
BIBLIOGRAPHY.md		BIBLIOGRAPHY.md
BIBLIOGRAPHY_USAGE_GUIDE.md		BIBLIOGRAPHY_USAGE_GUIDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LITERATURE_REVIEW_SUMMARY.md		LITERATURE_REVIEW_SUMMARY.md
METHODOLOGY.md		METHODOLOGY.md
Makefile		Makefile
QUICKSTART.md		QUICKSTART.md
README.md		README.md
RECOMMENDED_READING.md		RECOMMENDED_READING.md
REFERENCES_BY_TOPIC.md		REFERENCES_BY_TOPIC.md
abacus_1f9ee.webp		abacus_1f9ee.webp
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prompt Entropy Experiment

Research Question

Hypotheses

Experimental Design

Domains Studied

Repository Structure

Temperature Framework

Methodology

Entropy Metrics

Mutual Information Estimation

Quality Evaluation

Quick Start

Usage

Command-Line (Recommended)

Python API

Documentation

Compile Paper

Citation

Project Status

Author

License

Acknowledgments

About

Uh oh!

Releases

Uh oh!

Contributors 2

Uh oh!

Languages

License

ibrahimcesar/prompt-entropy-experiment

Folders and files

Latest commit

History

Repository files navigation

Prompt Entropy Experiment

Research Question

Hypotheses

Experimental Design

Domains Studied

Repository Structure

Temperature Framework

Methodology

Entropy Metrics

Mutual Information Estimation

Quality Evaluation

Quick Start

Usage

Command-Line (Recommended)

Python API

Documentation

Compile Paper

Citation

Project Status

Author

License

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 2

Uh oh!

Languages