Skip to content
Alessandro Demo edited this page Sep 19, 2025 · 32 revisions

📊 Geospatial Performance Benchmark

This document tracks the performance and characteristics of different geospatial software stacks for the use cases defined in the project.

Use Case 1: Data Ingestion

Goal: To measure the one-time cost of preparing OSM data (.pbf) for analysis, either by loading it into a database or by creating a smaller subset.

Operation Performed Technology Test Dataset Key Metrics Notes / Qualitative Observations
Load entire .pbf into a database PostGIS + osm2pgsql italy-latest.osm.pbf Import Time: ~1840s (~31 min)
Disk Space: 19 GB
Represents the high upfront cost to create a fully indexed, queryable database. Ideal for many subsequent, fast queries.
Extract a regional .pbf from a national .pbf osmium-tool italy-latest.osm.pbf Execution Time: ~21s Represents the cost of creating a smaller, more manageable file for file-based workflows. A pre-processing step for tools like PyOsmium.

(Note: For file-based tools like QuackOSM, the "ingestion" and "filtering" steps happen simultaneously and their performance is measured in Use Case 2)


Use Case 2: Data Filtering and Extraction

Goal: To measure the efficiency of extracting a specific subset of data (buildings in Milan) from a larger source, analyzing the impact of system caches.

Operation Performed Technology Run Type Test Dataset Key Metrics Notes / Qualitative Observations
Extract buildings from .pbf and save to .geoparquet DuckDB + QuackOSM Cold Start italy-latest.osm.pbf Execution Time: ~1.34s
Output: 6.97 MB
Performance is I/O-bound. This time reflects a true first-time run, limited by reading the large PBF from disk.
Extract buildings from .pbf and save to .geoparquet DuckDB + QuackOSM Hot Cache Avg. italy-latest.osm.pbf Execution Time: ~0.22s
Runs: 99
Pure in-memory performance. The ~6x speedup is due to the Operating System's file cache, proving the bottleneck is disk I/O, not CPU.
Extract buildings from database PostGIS Cold Start planet_osm_polygon table Execution Time: ~1.10s Represents a first-time query, limited by reading data pages and indexes from disk into PostgreSQL's internal cache.
Extract buildings from database PostGIS Hot Cache Avg. planet_osm_polygon table Execution Time: ~0.40s
Runs: 99
Pure database query performance. The ~2.75x speedup is due to the database's internal caching mechanism.
Read .pbf, build GeoDataFrame and filter by area PyOsmium + GeoPandas Cold Start lombardy-latest.osm.pbf Execution Time: ~358.9s
Output: 7.61 MB
Performance is heavily CPU-bound. This time reflects the high computational cost of the two-pass parsing and geometry construction in Python.
Read .pbf, build GeoDataFrame and filter by area PyOsmium + GeoPandas Hot Cache Avg. lombardy-latest.osm.pbf Execution Time: ~360.1s
Runs: 49
Nearly identical to the cold start, confirming the bottleneck is CPU processing, not disk I/O. The OS cache provides minimal benefit.

Use Case 3: Single Table Analysis (Geometric Properties)

Goal: Evaluate performance on calculations that do not require joins, but operate on geometries, scaling across different data sizes.

Operation Performed Technology Run Type Test Dataset Key Metrics Notes / Qualitative Observations
3.1. Top 10 Largest Areas (Pinerolo) DuckDB Spatial Cold Start pinerolo_buildings.geoparquet Execution Time: ~0.032s Found 10 results. Minimal time difference from hot run.
3.1. Top 10 Largest Areas (Pinerolo) DuckDB Spatial Hot Cache Avg. pinerolo_buildings.geoparquet Execution Time: ~0.017s
Runs: 99
The ~1.9x speedup on the small dataset suggests a file metadata caching benefit.
3.1. Top 10 Largest Areas (Pinerolo) PostGIS Cold Start pinerolo_buildings table Execution Time: ~0.044s Found 10 results. Performance comparable to other techs.
3.1. Top 10 Largest Areas (Pinerolo) PostGIS Hot Cache Avg. pinerolo_buildings table Execution Time: ~0.017s
Runs: 99
Significant speedup from database caching on the small dataset.
3.1. Top 10 Largest Areas (Pinerolo) GeoPandas Cold Start pinerolo_buildings.geoparquet Execution Time: ~0.071s Found 10 results. Slower on cold start due to library overhead.
3.1. Top 10 Largest Areas (Pinerolo) GeoPandas Hot Cache Avg. pinerolo_buildings.geoparquet Execution Time: ~0.053s
Runs: 99
The slowest of the three on this task, but still very fast.
3.2. Total Buffered Area (Pinerolo) DuckDB Spatial Cold Start pinerolo_buildings.geoparquet Execution Time: ~0.122s Result ~7.05M m². With the CRS fix, the result is numerically consistent with PostGIS/GeoPandas.
3.2. Total Buffered Area (Pinerolo) DuckDB Spatial Hot Cache Avg. pinerolo_buildings.geoparquet Execution Time: ~0.120s
Runs: 99
Performance is consistently CPU-bound.
3.2. Total Buffered Area (Pinerolo) PostGIS Cold Start pinerolo_buildings table Execution Time: ~0.144s Result ~7.05M m². Performance is CPU-bound.
3.2. Total Buffered Area (Pinerolo) PostGIS Hot Cache Avg. pinerolo_buildings table Execution Time: ~0.144s
Runs: 99
The result is consistent with GeoPandas, suggesting it is the correct baseline.
3.2. Total Buffered Area (Pinerolo) GeoPandas Cold Start pinerolo_buildings.geoparquet Execution Time: ~0.174s Result ~7.05M m². Performance is CPU-bound.
3.2. Total Buffered Area (Pinerolo) GeoPandas Hot Cache Avg. pinerolo_buildings.geoparquet Execution Time: ~0.176s
Runs: 99
The result confirms the PostGIS calculation.
3.3. Restaurants NOT near Bus Stops (Pinerolo) DuckDB Spatial Cold Start pinerolo_restaurants.geoparquet and pinerolo_bus_stops.geoparquet Execution Time: ~0.070s Found 15 results. Numerically consistent with other tools.
3.3. Restaurants NOT near Bus Stops (Pinerolo) DuckDB Spatial Hot Cache Avg. pinerolo_restaurants.geoparquet and pinerolo_bus_stops.geoparquet Execution Time: ~0.062s
Runs: 99
Performance is highly competitive.
3.3. Restaurants NOT near Bus Stops (Pinerolo) PostGIS Cold Start pinerolo_restaurants and pinerolo_bus_stops tables Execution Time: ~0.003s Found 15 results. Extremely fast due to database optimizations.
3.3. Restaurants NOT near Bus Stops (Pinerolo) PostGIS Hot Cache Avg. pinerolo_restaurants and pinerolo_bus_stops tables Execution Time: ~0.0008s
Runs: 99
The fastest technology for this small-scale join.
3.3. Restaurants NOT near Bus Stops (Pinerolo) GeoPandas Cold Start pinerolo_restaurants.geoparquet and pinerolo_bus_stops.geoparquet Execution Time: ~0.0045s Found 15 results. Result now matches PostGIS and DuckDB, using the more accurate buffer+sjoin method.
3.3. Restaurants NOT near Bus Stops (Pinerolo) GeoPandas Hot Cache Avg. pinerolo_restaurants.geoparquet and pinerolo_bus_stops.geoparquet Execution Time: ~0.0054s
Runs: 99
Achieves numerical consistency with a minimal performance cost on small data.
3.1. Top 10 Largest Areas (Milan) DuckDB Spatial Cold Start milan_buildings.geoparquet Execution Time: ~0.199s Found 10 results. Operation is now CPU-bound on a larger dataset.
3.1. Top 10 Largest Areas (Milan) DuckDB Spatial Hot Cache Avg. milan_buildings.geoparquet Execution Time: ~0.188s
Runs: 99
Minimal speedup, as the ST_Area calculation is the dominant cost.
3.1. Top 10 Largest Areas (Milan) PostGIS Cold Start milan_buildings table Execution Time: ~0.214s Found 10 results. Performance is nearly identical to DuckDB.
3.1. Top 10 Largest Areas (Milan) PostGIS Hot Cache Avg. milan_buildings table Execution Time: ~0.196s
Runs: 99
Similar CPU-bound behavior.
3.1. Top 10 Largest Areas (Milan) GeoPandas Cold Start milan_buildings.geoparquet Execution Time: ~0.835s Found 10 results. Significantly slower than database approaches on a medium dataset.
3.1. Top 10 Largest Areas (Milan) GeoPandas Hot Cache Avg. milan_buildings.geoparquet Execution Time: ~0.869s
Runs: 99
The performance gap widens on larger data.
3.2. Total Buffered Area (Milan) DuckDB Spatial Cold Start milan_buildings.geoparquet Execution Time: ~2.10s Result ~111.4M m². Numerically Correct and consistent with other tools.
3.2. Total Buffered Area (Milan) DuckDB Spatial Hot Cache Avg. milan_buildings.geoparquet Execution Time: ~2.08s
Runs: 99
Performance is faster than GeoPandas, slightly slower than PostGIS.
3.2. Total Buffered Area (Milan) PostGIS Cold Start milan_buildings table Execution Time: ~1.37s Result ~111.5M m². Fastest tech for this heavy computation.
3.2. Total Buffered Area (Milan) PostGIS Hot Cache Avg. milan_buildings table Execution Time: ~1.34s
Runs: 99
Consistently CPU-bound behavior.
3.2. Total Buffered Area (Milan) GeoPandas Cold Start milan_buildings.geoparquet Execution Time: ~2.99s Result ~111.5M m². Slower than the databases, but confirms the PostGIS result.
3.2. Total Buffered Area (Milan) GeoPandas Hot Cache Avg. milan_buildings.geoparquet Execution Time: ~2.93s
Runs: 99
Consistently CPU-bound.
3.3. Restaurants NOT near Bus Stops (Milan) DuckDB Spatial Cold Start milan_restaurants.geoparquet and milan_bus_stops.geoparquet Execution Time: ~0.087s Found 2135 results. Numerically consistent.
3.3. Restaurants NOT near Bus Stops (Milan) DuckDB Spatial Hot Cache Avg. milan_restaurants.geoparquet and milan_bus_stops.geoparquet Execution Time: ~0.077s
Runs: 99
Significantly faster than PostGIS for this query type.
3.3. Restaurants NOT near Bus Stops (Milan) PostGIS Cold Start milan_restaurants and milan_bus_stops tables Execution Time: ~10.53s Found 2135 results. Very slow on this operation.
3.3. Restaurants NOT near Bus Stops (Milan) PostGIS Hot Cache Avg. milan_restaurants and milan_bus_stops tables Execution Time: ~10.49s
Runs: 99
Confirms poor performance on this type of spatial anti-join.
3.3. Restaurants NOT near Bus Stops (Milan) GeoPandas Cold Start milan_restaurants.geoparquet and milan_bus_stops.geoparquet Execution Time: ~0.049s Found 2136 results. Numerically consistent with PostGIS/DuckDB.
3.3. Restaurants NOT near Bus Stops (Milan) GeoPandas Hot Cache Avg. milan_restaurants.geoparquet and milan_bus_stops.geoparquet Execution Time: ~0.048s
Runs: 99
Still ~220x faster than PostGIS. The performance cost for accuracy is small compared to the gain in reliability.
3.1. Top 10 Largest Areas (Rome) DuckDB Spatial Cold Start rome_buildings.geoparquet Execution Time: ~0.52s Found 10 results. Performance scales predictably.
3.1. Top 10 Largest Areas (Rome) DuckDB Spatial Hot Cache Avg. rome_buildings.geoparquet Execution Time: ~0.51s
Runs: 99
Consistently CPU-bound on the largest dataset.
3.1. Top 10 Largest Areas (Rome) PostGIS Cold Start rome_buildings table Execution Time: ~0.40s PostGIS becomes faster than DuckDB on the largest dataset for this query.
3.1. Top 10 Largest Areas (Rome) PostGIS Hot Cache Avg. rome_buildings table Execution Time: ~0.38s
Runs: 99
Demonstrates better scaling than DuckDB for this operation.
3.1. Top 10 Largest Areas (Rome) GeoPandas Cold Start rome_buildings.geoparquet Execution Time: ~2.52s The performance gap between GeoPandas and the databases is largest on this dataset.
3.1. Top 10 Largest Areas (Rome) GeoPandas Hot Cache Avg. rome_buildings.geoparquet Execution Time: ~2.63s
Runs: 99
Confirms that for pure computation on large in-memory data, database engines are more optimized.
3.2. Total Buffered Area (Rome) DuckDB Spatial Cold Start rome_buildings.geoparquet Execution Time: ~6.48s Result ~298.9M m². Numerically Correct and consistent.
3.2. Total Buffered Area (Rome) DuckDB Spatial Hot Cache Avg. rome_buildings.geoparquet Execution Time: ~6.39s
Runs: 99
Slower than PostGIS, but faster than GeoPandas on the largest dataset.
3.2. Total Buffered Area (Rome) PostGIS Cold Start rome_buildings table Execution Time: ~2.80s Result ~299.2M m². PostGIS is over 2x faster than DuckDB on this heavy computation.
3.2. Total Buffered Area (Rome) PostGIS Hot Cache Avg. rome_buildings table Execution Time: ~2.78s
Runs: 99
Confirms superior performance for complex, single-table aggregations on large data.
3.2. Total Buffered Area (Rome) GeoPandas Cold Start rome_buildings.geoparquet Execution Time: ~9.38s Result ~299.2M m². Numerically consistent with PostGIS but the slowest tech.
3.2. Total Buffered Area (Rome) GeoPandas Hot Cache Avg. rome_buildings.geoparquet Execution Time: ~9.14s
Runs: 99
The performance gap widens significantly on the largest dataset.
3.3. Restaurants NOT near Bus Stops (Rome) DuckDB Spatial Cold Start rome_restaurants.geoparquet and rome_bus_stops.geoparquet Execution Time: ~0.090s Found 2298 results. Numerically consistent.
3.3. Restaurants NOT near Bus Stops (Rome) DuckDB Spatial Hot Cache Avg. rome_restaurants.geoparquet and rome_bus_stops.geoparquet Execution Time: ~0.081s
Runs: 99
Dramatically faster than PostGIS for this large join.
3.3. Restaurants NOT near Bus Stops (Rome) PostGIS Cold Start rome_restaurants and rome_bus_stops tables Execution Time: ~40.1s Found 2298 results. Extremely slow on the largest dataset.
3.3. Restaurants NOT near Bus Stops (Rome) PostGIS Hot Cache Avg. rome_restaurants and rome_bus_stops tables Execution Time: ~40.3s
Runs: 99
Performance of the join is significantly slower than DuckDB on the largest dataset.
3.3. Restaurants NOT near Bus Stops (Rome) GeoPandas Cold Start rome_restaurants.geoparquet and rome_bus_stops.geoparquet Execution Time: ~0.133s Found 2299 results. Numerically Correct. The buffer+sjoin method is more computationally expensive.
3.3. Restaurants NOT near Bus Stops (Rome) GeoPandas Hot Cache Avg. rome_restaurants.geoparquet and rome_bus_stops.geoparquet Execution Time: ~0.130s
Runs: 99
Highlights the trade-off: slower than the old method, but now correct and still ~310x faster than PostGIS.

Use Case 4: Complex Spatial Join

Goal: To test performance on computationally intensive join operations on large datasets.

Operation Performed Technology Run Type Test Dataset Key Metrics Notes / Qualitative Observations
4.1. Restaurants per Neighborhood (Pinerolo) DuckDB Spatial Cold Start pinerolo_neighborhoods.geoparquet and pinerolo_restaurants.geoparquet Execution Time: ~0.033s
Output: 0.13 MB
Found 13 results, numerically consistent with other tools.
4.1. Restaurants per Neighborhood (Pinerolo) DuckDB Spatial Hot Cache Avg. pinerolo_neighborhoods.geoparquet and pinerolo_restaurants.geoparquet Execution Time: ~0.004s
Runs: 99
Extremely fast on hot runs.
4.1. Restaurants per Neighborhood (Pinerolo) PostGIS Cold Start pinerolo_neighborhoods and pinerolo_restaurants tables Execution Time: ~0.041s Found 13 results, numerically consistent with others.
4.1. Restaurants per Neighborhood (Pinerolo) PostGIS Hot Cache Avg. pinerolo_neighborhoods and pinerolo_restaurants tables Execution Time: ~0.015s
Runs: 99
Good speedup from database caching.
4.1. Restaurants per Neighborhood (Pinerolo) GeoPandas Cold Start pinerolo_neighborhoods.geoparquet and pinerolo_restaurants.geoparquet Execution Time: ~0.006s
Output: 0.12 MB
Found 13 results. The fastest technology for this task on a small dataset.
4.1. Restaurants per Neighborhood (Pinerolo) GeoPandas Hot Cache Avg. pinerolo_neighborhoods.geoparquet and pinerolo_restaurants.geoparquet Execution Time: ~0.005s
Runs: 99
Confirms its status as the top performer for this operation.
4.2. Tree Count and Length Sum of Residential Roads Near Hospitals (Pinerolo) DuckDB Spatial Cold Start pinerolo_hospitals.geoparquet, pinerolo_residential_streets.geoparquet and pinerolo_trees.geoparquet Execution Time: ~0.178s Result: 0 trees, ~5.0k meters. Numerically Correct and consistent with PostGIS/GeoPandas.
4.2. Tree Count and Length Sum of Residential Roads Near Hospitals (Pinerolo) DuckDB Spatial Hot Cache Avg. pinerolo_hospitals.geoparquet, pinerolo_residential_streets.geoparquet and pinerolo_trees.geoparquet Execution Time: ~0.118s
Runs: 99
Slower than GeoPandas, but shows competitive performance.
4.2. Tree Count and Length Sum of Residential Roads Near Hospitals (Pinerolo) PostGIS Cold Start pinerolo_hospitals, pinerolo_residential_streets and pinerolo_trees tables Execution Time: ~0.099s Result: 0 trees, ~5.0k meters. Numerical result is the baseline.
4.2. Tree Count and Length Sum of Residential Roads Near Hospitals (Pinerolo) PostGIS Hot Cache Avg. pinerolo_hospitals, pinerolo_residential_streets and pinerolo_trees tables Execution Time: ~0.079s
Runs: 99
The slowest of the three on the small dataset for this query type.
4.2. Tree Count and Length Sum of Residential Roads Near Hospitals (Pinerolo) GeoPandas Cold Start pinerolo_hospitals.geoparquet, pinerolo_residential_streets.geoparquet and pinerolo_trees.geoparquet Execution Time: ~0.015s Result: 0 trees, ~5.0k meters. Numerically consistent with PostGIS.
4.2. Tree Count and Length Sum of Residential Roads Near Hospitals (Pinerolo) GeoPandas Hot Cache Avg. pinerolo_hospitals.geoparquet, pinerolo_residential_streets.geoparquet and pinerolo_trees.geoparquet Execution Time: ~0.008s
Runs: 99
Significantly faster than the databases due to sjoin_nearest.
4.3. Area Not Covered by Parks (Pinerolo) DuckDB Spatial Cold Start pinerolo_parks.geoparquet and osmnx boundary Execution Time: ~0.033s Result 49.97 km². Numerically Correct. With the CRS fix, the result is now perfectly aligned.
4.3. Area Not Covered by Parks (Pinerolo) DuckDB Spatial Hot Cache Avg. pinerolo_parks.geoparquet and osmnx boundary Execution Time: ~0.025s
Runs: 99
Performance is competitive and the task is CPU-bound.
4.3. Area Not Covered by Parks (Pinerolo) PostGIS Cold Start pinerolo_parks table and osmnx boundary Execution Time: ~0.021s Result 49.97 km². Numerically Correct. PostGIS handles the data correctly.
4.3. Area Not Covered by Parks (Pinerolo) PostGIS Hot Cache Avg. pinerolo_parks table and osmnx boundary Execution Time: ~0.006s
Runs: 99
Very fast and numerically robust. ~3.5x speedup from caching.
4.3. Area Not Covered by Parks (Pinerolo) GeoPandas Cold Start pinerolo_parks.geoparquet and osmnx boundary Execution Time: ~0.005s Result 49.97 km². Numerically Correct, confirming the PostGIS/GEOS baseline.
4.3. Area Not Covered by Parks (Pinerolo) GeoPandas Hot Cache Avg. pinerolo_parks.geoparquet and osmnx boundary Execution Time: ~0.005s
Runs: 99
The fastest of the three for this task and numerically robust.
4.1. Restaurants per Neighborhood (Milan) DuckDB Spatial Cold Start milan_neighborhoods.geoparquet and milan_restaurants.geoparquet Execution Time: ~0.119s
Output: 0.29 MB
Found 34 results, consistent with other tools.
4.1. Restaurants per Neighborhood (Milan) DuckDB Spatial Hot Cache Avg. milan_neighborhoods.geoparquet and milan_restaurants.geoparquet Execution Time: ~0.097s
Runs: 99
Slower than PostGIS on this medium-scale join.
4.1. Restaurants per Neighborhood (Milan) PostGIS Cold Start milan_neighborhoods and milan_restaurants tables Execution Time: ~0.220s Found 34 results. Faster than DuckDB on the cold run.
4.1. Restaurants per Neighborhood (Milan) PostGIS Hot Cache Avg. milan_neighborhoods and milan_restaurants tables Execution Time: ~0.201s
Runs: 99
Faster than DuckDB, but significantly slower than GeoPandas.
4.1. Restaurants per Neighborhood (Milan) GeoPandas Cold Start milan_neighborhoods.geoparquet and milan_restaurants.geoparquet Execution Time: ~0.010s
Output: 0.28 MB
Found 34 results. Extraordinarily fast, showing the power of in-memory operations.
4.1. Restaurants per Neighborhood (Milan) GeoPandas Hot Cache Avg. milan_neighborhoods.geoparquet and milan_restaurants.geoparquet Execution Time: ~0.009s
Runs: 99
Over 20x faster than PostGIS.
4.2. Tree Count and Length Sum of Residential Roads Near Hospitals (Milan) DuckDB Spatial Cold Start milan_hospitals.geoparquet, milan_residential_streets.geoparquet and milan_trees.geoparquet Execution Time: ~0.193s Result: 234 trees, ~25.0k meters. Numerically Correct and consistent with PostGIS/GeoPandas.
4.2. Tree Count and Length Sum of Residential Roads Near Hospitals (Milan) DuckDB Spatial Hot Cache Avg. milan_hospitals.geoparquet, milan_residential_streets.geoparquet and milan_trees.geoparquet Execution Time: ~0.155s
Runs: 99
Significantly faster than PostGIS, but much slower than GeoPandas.
4.2. Tree Count and Length Sum of Residential Roads Near Hospitals (Milan) PostGIS Cold Start milan_hospitals, milan_residential_streets and milan_trees tables Execution Time: ~34.16s Result: 234 trees, ~25.0k meters. The numerical baseline.
4.2. Tree Count and Length Sum of Residential Roads Near Hospitals (Milan) PostGIS Hot Cache Avg. milan_hospitals, milan_residential_streets and milan_trees tables Execution Time: ~33.11s
Runs: 99
The slowest technology by a large margin for this query type.
4.2. Tree Count and Length Sum of Residential Roads Near Hospitals (Milan) GeoPandas Cold Start milan_hospitals.geoparquet, milan_residential_streets.geoparquet and milan_trees.geoparquet Execution Time: ~0.079s Result: 234 trees, ~25.1k meters. Numerically consistent with PostGIS.
4.2. Tree Count and Length Sum of Residential Roads Near Hospitals (Milan) GeoPandas Hot Cache Avg. milan_hospitals.geoparquet, milan_residential_streets.geoparquet and milan_trees.geoparquet Execution Time: ~0.082s
Runs: 99
Significantly faster than the databases. sjoin_nearest is a clear winner.
4.3. Area Not Covered by Parks (Milan) DuckDB Spatial Cold Start milan_parks.geoparquet and osmnx boundary Execution Time: ~0.209s Result 162.98 km². Numerically Correct. A massive improvement in performance and correctness.
4.3. Area Not Covered by Parks (Milan) DuckDB Spatial Hot Cache Avg. milan_parks.geoparquet and osmnx boundary Execution Time: ~0.198s
Runs: 99
Performance is now highly competitive with PostGIS. The task is CPU-bound.
4.3. Area Not Covered by Parks (Milan) PostGIS Cold Start milan_parks table and osmnx boundary Execution Time: ~0.142s Result 162.98 km². Numerically Correct. PostGIS is the fastest for this operation.
4.3. Area Not Covered by Parks (Milan) PostGIS Hot Cache Avg. milan_parks table and osmnx boundary Execution Time: ~0.138s
Runs: 99
Shows extreme efficiency and numerical robustness for complex geometry operations.
4.3. Area Not Covered by Parks (Milan) GeoPandas Cold Start milan_parks.geoparquet and osmnx boundary Execution Time: ~0.430s Result 162.98 km². Numerically Correct. Slower than PostGIS but confirms the robustness of the GEOS engine.
4.3. Area Not Covered by Parks (Milan) GeoPandas Hot Cache Avg. milan_parks.geoparquet and osmnx boundary Execution Time: ~0.423s
Runs: 99
Confirms robustness but is slower than the databases for this task.
4.1. Restaurants per Neighborhood (Rome) DuckDB Spatial Cold Start rome_neighborhoods.geoparquet and rome_restaurants.geoparquet Execution Time: ~4.97s
Output: 2.23 MB
Found 52 results, consistent with other tools.
4.1. Restaurants per Neighborhood (Rome) DuckDB Spatial Hot Cache Avg. rome_neighborhoods.geoparquet and rome_restaurants.geoparquet Execution Time: ~4.36s
Runs: 99
Faster than PostGIS on the largest dataset.
4.1. Restaurants per Neighborhood (Rome) PostGIS Cold Start rome_neighborhoods and rome_restaurants tables Execution Time: ~6.68s Found 52 results. Slower than DuckDB on the largest dataset.
4.1. Restaurants per Neighborhood (Rome) PostGIS Hot Cache Avg. rome_neighborhoods and rome_restaurants tables Execution Time: ~5.32s
Runs: 99
Caching provides a noticeable benefit.
4.1. Restaurants per Neighborhood (Rome) GeoPandas Cold Start rome_neighborhoods.geoparquet and rome_restaurants.geoparquet Execution Time: ~0.034s
Output: 2.23 MB
Found 52 results. The performance gap is enormous on the largest dataset.
4.1. Restaurants per Neighborhood (Rome) GeoPandas Hot Cache Avg. rome_neighborhoods.geoparquet and rome_restaurants.geoparquet Execution Time: ~0.031s
Runs: 99
GeoPandas is ~140x faster than DuckDB for this Point-In-Polygon join.
4.2. Tree Count and Length Sum of Residential Roads Near Hospitals (Rome) DuckDB Spatial Cold Start rome_hospitals.geoparquet, rome_residential_streets.geoparquet and rome_trees.geoparquet Execution Time: ~0.288s Result: 441 trees, ~103.7k meters. Numerically Correct and consistent.
4.2. Tree Count and Length Sum of Residential Roads Near Hospitals (Rome) DuckDB Spatial Hot Cache Avg. rome_hospitals.geoparquet, rome_residential_streets.geoparquet and rome_trees.geoparquet Execution Time: ~0.258s
Runs: 99
Massive performance improvement. Now faster than GeoPandas on the cold run, and competitive on hot.
4.2. Tree Count and Length Sum of Residential Roads Near Hospitals (Rome) PostGIS Cold Start rome_hospitals, rome_residential_streets and rome_trees tables Execution Time: ~185.74s Result: 441 trees, ~103.1k meters. The numerical baseline.
4.2. Tree Count and Length Sum of Residential Roads Near Hospitals (Rome) PostGIS Hot Cache Avg. rome_hospitals, rome_residential_streets and rome_trees tables Execution Time: ~186.80s
Runs: 99
The performance gap with other tools is enormous (~1000x).
4.2. Tree Count and Length Sum of Residential Roads Near Hospitals (Rome) GeoPandas Cold Start rome_hospitals.geoparquet, rome_residential_streets.geoparquet and rome_trees.geoparquet Execution Time: ~0.189s Result: 441 trees, ~103.3k meters. Numerically consistent with PostGIS.
4.2. Tree Count and Length Sum of Residential Roads Near Hospitals (Rome) GeoPandas Hot Cache Avg. rome_hospitals.geoparquet, rome_residential_streets.geoparquet and rome_trees.geoparquet Execution Time: ~0.186s
Runs: 99
GeoPandas is ~1000x faster than PostGIS.
4.3. Area Not Covered by Parks (Rome) DuckDB Spatial Cold Start rome_parks.geoparquet and osmnx boundary Execution Time: ~0.346s Result 1,235 km². Numerically Correct. Massive improvement.
4.3. Area Not Covered by Parks (Rome) DuckDB Spatial Hot Cache Avg. rome_parks.geoparquet and osmnx boundary Execution Time: ~0.335s
Runs: 99
Performance is now competitive with the other tools.
4.3. Area Not Covered by Parks (Rome) PostGIS Cold Start rome_parks table and osmnx boundary Execution Time: ~0.248s Result 1,235 km². Numerically Correct. PostGIS is the fastest and most robust.
4.3. Area Not Covered by Parks (Rome) PostGIS Hot Cache Avg. rome_parks table and osmnx boundary Execution Time: ~0.236s
Runs: 99
Demonstrates extreme efficiency on large, complex geometric operations.
4.3. Area Not Covered by Parks (Rome) GeoPandas Cold Start rome_parks.geoparquet and osmnx boundary Execution Time: ~0.782s Result 1,235 km². Numerically Correct. Slower than PostGIS, but robust.
4.3. Area Not Covered by Parks (Rome) GeoPandas Hot Cache Avg. rome_parks.geoparquet and osmnx boundary Execution Time: ~0.784s
Runs: 99
The GEOS-based tools are both numerically correct and performant for this task.

Appendix: Benchmarks on NON-OSM Data Formats

This section will track the performance of the core technologies on standard geospatial data formats, without the complexity of OSM-specific parsing libraries.

Pure Vector Data (Shapefile)

Goal: To measure ingestion and filtering performance on a standard Shapefile dataset, analyzing the impact of system caches.

Use Case 1: Ingestion

Operation Performed Technology Run Type Test Dataset Key Metrics Notes / Qualitative Observations
Read Shapefile using ST_Read DuckDB Spatial Cold Start Com01012025_WGS84.shp Execution Time: ~0.44s Found Execution 7896 features. Very fast initial read from disk.
Read Shapefile using ST_Read DuckDB Spatial Hot Cache Avg. Com01012025_WGS84.shp Execution Time: ~0.33s
Runs: 99
The OS file cache provides a noticeable, but smaller (~1.3x), speedup.
Load Shapefile into database PostGIS + shp2pgsql // Com01012025_WGS84.shp Execution Time: ~21s Represents the one-time cost of manual ingestion via CLI. Significantly slower due to database overhead.
Read Shapefile into GeoDataFrame GeoPandas Cold Start Com01012025_WGS84.shp Execution Time: ~0.83s Found 7896 features. Performance reflects reading the file from disk.
Read Shapefile into GeoDataFrame GeoPandas Hot Cache Avg. Com01012025_WGS84.shp Execution Time: ~0.34s
Runs: 99
The ~2.4x speedup is due to the OS file cache.

Use Case 2: Filtering

Operation Performed Technology Run Type Test Dataset Key Metrics Notes / Qualitative Observations
Filter table by attribute DuckDB Spatial Cold Start Com01012025_WGS84.shp Execution Time: ~0.020s
Output: 12.57 MB
Filtered to 1180 features. Very fast in-memory query.
Filter table by attribute DuckDB Spatial Hot Cache Avg. Com01012025_WGS84.shp Execution Time: ~0.019s
Runs: 99
CPU-bound, with minimal difference between cold and hot runs.
Filter table by attribute PostGIS Cold Start comuni_istat table Execution Time: ~0.22s Found 1180 features. Represents a query where data is read from disk into the database cache.
Filter table by attribute PostGIS Hot Cache Avg. comuni_istat table Execution Time: ~0.075s
Runs: 99
The ~2.9x speedup is due to PostgreSQL's internal caching mechanism.
Filter GeoDataFrame by attribute GeoPandas Cold Start Com01012025_WGS84.shp Execution Time: ~0.0010s
Output: 12.58 MB
Filtered to 1180 features. In-memory operation is extremely fast.
Filter GeoDataFrame by attribute GeoPandas Hot Cache Avg. Com01012025_WGS84.shp Execution Time: ~0.0009s
Runs: 99
Negligible difference from cold start, confirming the operation is CPU-bound.

Raster Data (GeoTIFF)

Goal: To measure ingestion and filtering performance on a large-scale raster dataset (GHS-POP), analyzing the impact of system caches and comparing file-based vs database approaches.

Use Case 1: Ingestion

Operation Performed Technology Run Type Test Dataset Key Metrics Notes / Qualitative Observations
Open GeoTIFF file DuckDB Spatial N/A GHS_POP_ITALY_100m.tif Execution Time: N/A Technology does not support native raster operations.
Load Raster into database PostGIS + raster2pgsql // GHS_POP_ITALY_100m.tif Execution Time: ~370s (~6.2 min) One-time cost for manual ingestion via CLI. Includes table creation, tiling, and indexing of the entire Italy raster.
Open GeoTIFF file Python (rioxarray) Cold Start GHS_POP_ITALY_100m.tif Execution Time: ~0.15s Measures the time to read file metadata from disk.
Open GeoTIFF file Python (rioxarray) Hot Cache Avg. GHS_POP_ITALY_100m.tif Execution Time: ~0.03s
Runs: 99
The ~5.7x speedup is due to the OS file caching of the file headers.

Use Case 2: Filtering

Operation Performed Technology Run Type Test Dataset Key Metrics Notes / Qualitative Observations
Clip raster to Milan DuckDB Spatial N/A GHS_POP_ITALY_100m.tif Execution Time: N/A Technology does not support native raster operations.
Clip raster to Milan PostGIS Cold Start ghs_population table Execution Time: ~8.12s Found 1,256,044 people. The query is computationally expensive.
Clip raster to Milan PostGIS Hot Cache Avg. ghs_population table Execution Time: ~7.40s
Runs: 99
The minimal speedup confirms the operation is CPU-bound (raster processing) rather than I/O-bound.
Clip raster to Milan Python (rioxarray) Cold Start GHS_POP_ITALY_100m.tif Execution Time: ~7.57s
Output: 0.08 MB
Found 1,256,044 people. Performance is highly competitive with PostGIS.
Clip raster to Milan Python (rioxarray) Hot Cache Avg. GHS_POP_ITALY_100m.tif Execution Time: ~7.91s
Runs: 99
Performance is nearly identical to the cold start, definitively proving the operation is CPU-bound.

(Note: GHS_POP_ITALY_100m.tif is a TIF file generated using the GDAL merging tool via CLI, the dataset downloaded from Copernicus are the for 4 that contains every part of Italy.)


Use Case 5: Vector-Raster Analysis

Goal: To evaluate the performance of different technologies for calculating raster statistics based on vector zones.

Operation Performed Technology Run Type Test Dataset Key Metrics Notes / Qualitative Observations
Calculate Population per Municipality (Piedmont) PostGIS Cold Start comuni_istat_clean and ghs_population tables Execution Time: ~2.50s
GPKG Loading Time: ~17.87s
Calculated total population of 4,222,758 for 1180 municipalities. Result is numerically identical to the Python stack, even though it's needed an additional script to convert the data in a GPKG file, which needs more time to download the data again in the database.
Calculate Population per Municipality (Piedmont) PostGIS Hot Cache Avg. comuni_istat_clean and ghs_population tables Execution Time: ~2.43s
Runs: 99
The operation is CPU-bound. PostGIS is ~5.5x faster than the Python stack for this task.
Calculate Population per Municipality (Piedmont) Python (rasterstats) Cold Start Com01012025_WGS84.shp and GHS_POP_ITALY_100m.tif Execution Time: ~13.97s Calculated total population of 4,222,758 for 1180 municipalities. Result is numerically identical to PostGIS.
Calculate Population per Municipality (Piedmont) Python (rasterstats) Hot Cache Avg. Com01012025_WGS84.shp and GHS_POP_ITALY_100m.tif Execution Time: ~13.45s
Runs: 99
The operation is also CPU-bound. The high-level libraries introduce a significant performance overhead compared to the database.

(Note: DuckDB still does not support native raster operations.)