Presentation demos for the IPA Rwanda country office showcasing modern data processing tools through practical benchmarks.
Three self-contained demonstrations comparing traditional vs. modern approaches using a synthetic 500,000-row Rwanda household survey dataset:
- Parquet vs CSV - File format comparison (write/read speed, storage size, query performance)
- Polars vs Pandas - DataFrame library comparison (5 common operations)
- DuckDB vs PostgreSQL - Database comparison (4 analytical queries + Parquet scanning)
All demos use the same reproducible synthetic dataset representing a typical IPA Rwanda household survey with districts, treatment arms, survey rounds, and income data.
- Python 3.12+
- UV package manager
- PostgreSQL binaries (for demo3 only) - Download here
# Install/update all dependencies
uv syncOpen the Jupyter notebooks in VS Code or Jupyter Lab:
demo1_parquet_vs_csv.ipynbdemo2_polars_vs_pandas.ipynbdemo3_duckdb_vs_sql.ipynb
Compares Parquet and CSV file formats across:
- Write performance
- Read performance
- File size efficiency
- Query performance (filtered reads)
Benchmarks five common data operations:
- Filtering rows
- Grouping and aggregation
- Joining datasets
- Creating new columns
- Sorting
Compares embedded analytical database (DuckDB) vs. traditional database (PostgreSQL):
- Aggregation queries
- Filtering and grouping
- Complex joins
- Window functions
- Direct Parquet file scanning (DuckDB only)
All demos generate identical synthetic data with:
- 500,000 rows
- 10 Rwandan districts across 5 provinces (Kigali, Eastern, Western, Northern, Southern)
- 3 treatment arms: control, treatment_A, treatment_B
- 3 survey rounds: baseline, midline, endline
- Monthly income (USD): exponential distribution reflecting rural household income
- GPS coordinates: realistic bounds for Rwanda
Seed is fixed (SEED = 42) for reproducibility.
Each demo notebook produces:
- Inline benchmark results and visualizations
- Interactive HTML charts saved to the output directory
- Temporary data files (cleaned up automatically)
Core libraries:
pandas- Traditional DataFrame librarypolars- Modern high-performance DataFrame librarypyarrow- Parquet file supportduckdb- Embedded analytical databasepsycopg2-binary+testing.postgresql- PostgreSQL supportplotly- Interactive visualizationsnumpy- Data generation
See pyproject.toml for complete dependency list.
The notebooks can be edited directly in VS Code or Jupyter Lab. All code is contained in the .ipynb files.
- Python: 3.12 (Windows)
- Venv:
.venv/(auto-detected by VS Code) - Package manager: UV via
pyproject.toml
| File | Description |
|---|---|
demo1_parquet_vs_csv.ipynb |
Parquet vs CSV benchmark |
demo2_polars_vs_pandas.ipynb |
Polars vs Pandas benchmark |
demo3_duckdb_vs_sql.ipynb |
DuckDB vs PostgreSQL benchmark |
pyproject.toml |
Dependencies and project config |
README.md |
This file |
- Demo 3 uses
testing.postgresqlto automatically spin up a temporary PostgreSQL server - no manual setup required - All benchmarks include interactive Plotly charts saved as HTML files
- Results are reproducible across runs due to fixed random seed
For use by IPA Rwanda country office demonstrations.