Data Efficiency Tools

Presentation demos for the IPA Rwanda country office showcasing modern data processing tools through practical benchmarks.

Overview

Three self-contained demonstrations comparing traditional vs. modern approaches using a synthetic 500,000-row Rwanda household survey dataset:

Parquet vs CSV - File format comparison (write/read speed, storage size, query performance)
Polars vs Pandas - DataFrame library comparison (5 common operations)
DuckDB vs PostgreSQL - Database comparison (4 analytical queries + Parquet scanning)

All demos use the same reproducible synthetic dataset representing a typical IPA Rwanda household survey with districts, treatment arms, survey rounds, and income data.

Quick Start

Prerequisites

Python 3.12+
UV package manager
PostgreSQL binaries (for demo3 only) - Download here

Installation

# Install/update all dependencies
uv sync

Running Demos

Open the Jupyter notebooks in VS Code or Jupyter Lab:

demo1_parquet_vs_csv.ipynb
demo2_polars_vs_pandas.ipynb
demo3_duckdb_vs_sql.ipynb

Demo Details

Demo 1: Parquet vs CSV

Compares Parquet and CSV file formats across:

Write performance
Read performance
File size efficiency
Query performance (filtered reads)

Demo 2: Polars vs Pandas

Benchmarks five common data operations:

Filtering rows
Grouping and aggregation
Joining datasets
Creating new columns
Sorting

Demo 3: DuckDB vs PostgreSQL

Compares embedded analytical database (DuckDB) vs. traditional database (PostgreSQL):

Aggregation queries
Filtering and grouping
Complex joins
Window functions
Direct Parquet file scanning (DuckDB only)

Dataset

All demos generate identical synthetic data with:

500,000 rows
10 Rwandan districts across 5 provinces (Kigali, Eastern, Western, Northern, Southern)
3 treatment arms: control, treatment_A, treatment_B
3 survey rounds: baseline, midline, endline
Monthly income (USD): exponential distribution reflecting rural household income
GPS coordinates: realistic bounds for Rwanda

Seed is fixed (SEED = 42) for reproducibility.

Output

Each demo notebook produces:

Inline benchmark results and visualizations
Interactive HTML charts saved to the output directory
Temporary data files (cleaned up automatically)

Dependencies

Core libraries:

pandas - Traditional DataFrame library
polars - Modern high-performance DataFrame library
pyarrow - Parquet file support
duckdb - Embedded analytical database
psycopg2-binary + testing.postgresql - PostgreSQL support
plotly - Interactive visualizations
numpy - Data generation

See pyproject.toml for complete dependency list.

Development

The notebooks can be edited directly in VS Code or Jupyter Lab. All code is contained in the .ipynb files.

Environment

Python: 3.12 (Windows)
Venv: .venv/ (auto-detected by VS Code)
Package manager: UV via pyproject.toml

Files

File	Description
`demo1_parquet_vs_csv.ipynb`	Parquet vs CSV benchmark
`demo2_polars_vs_pandas.ipynb`	Polars vs Pandas benchmark
`demo3_duckdb_vs_sql.ipynb`	DuckDB vs PostgreSQL benchmark
`pyproject.toml`	Dependencies and project config
`README.md`	This file

Notes

Demo 3 uses testing.postgresql to automatically spin up a temporary PostgreSQL server - no manual setup required
All benchmarks include interactive Plotly charts saved as HTML files
Results are reproducible across runs due to fixed random seed

License

For use by IPA Rwanda country office demonstrations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Efficiency Tools

Overview

Quick Start

Prerequisites

Installation

Running Demos

Demo Details

Demo 1: Parquet vs CSV

Demo 2: Polars vs Pandas

Demo 3: DuckDB vs PostgreSQL

Dataset

Output

Dependencies

Development

Environment

Files

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
output		output
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
demo1_parquet_vs_csv.ipynb		demo1_parquet_vs_csv.ipynb
demo2_polars_vs_pandas.ipynb		demo2_polars_vs_pandas.ipynb
demo3_duckdb_vs_sql.ipynb		demo3_duckdb_vs_sql.ipynb
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Data Efficiency Tools

Overview

Quick Start

Prerequisites

Installation

Running Demos

Demo Details

Demo 1: Parquet vs CSV

Demo 2: Polars vs Pandas

Demo 3: DuckDB vs PostgreSQL

Dataset

Output

Dependencies

Development

Environment

Files

Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages