Address layer #10

ianthetechie · 2025-07-09T13:36:42Z

Opening as a draft since this builds on #7 and there are quite a few things open for discussion (marked clearly with TODO comments).

ianthetechie · 2025-09-14T06:44:29Z

cc @missinglink; might be interesting for you. My short-term goal is to make this a more efficient layer generation pipeline that Pelias and other geocoders could ingest... when running with a filtered PBF, I can get a GeoParquet out in < 3 hours, which is really easy to index from there!

I'll be doing some deeper verification next week, but I think I've captured most of the edge cases like street relations.

missinglink · 2025-09-15T09:11:38Z

hey @ianthetechie, looking good.

I noticed you added a bunch of special cases for Czechia etc. 👍

when running with a filtered PBF, I can get a GeoParquet out in < 3 hours

Nice! I'm not sure how layercake works internally; in the past I had a success with running over the PBF file once to generate bitmasks of the target N/W/R and then doing a second pass over it to pull out the data. Generally each pass over the planet takes ~20mins, something to consider if that's not already how it works.

ianthetechie · 2025-09-18T06:22:48Z

I noticed you added a bunch of special cases for Czechia etc. 👍

Yeah 😎

I'm not sure how layercake works internally; in the past I had a success with running over the PBF file once to generate bitmasks of the target N/W/R and then doing a second pass over it to pull out the data. Generally each pass over the planet takes ~20mins, something to consider if that's not already how it works.

Not like that 😅 It currently operates over the entire (unfiltered) planet in the OSM US hosted iteration. In our internal (Stadia Maps) pipelines, I run osmium filter in advance. This is theoretically not supposed to be needed, but it's clear that our current understanding of PyOsmium is flawed. I've opened an issue to that effect over in #20.

~20mins is about the time I see as well for an osmium tag filter too btw. That enables it to proceed entirely in memory, and spits out the parquet file in < 3 total wall clock time. From there, it's basically a matter of how fast your Elasticsearch cluster can index for geocoding.

1ec5 · 2025-09-18T07:15:17Z

src/addresses.py

+        ("addr:conscriptionnumber", pyarrow.string()),
+        ("addr:streetnumber", pyarrow.string()),
+        ("addr:provisionalnumber", pyarrow.string()),
+        ("addr:unit", pyarrow.string()),


addr:unit=* was originally approved for any kind of unit in an address, but it turns out that some countries’ postal systems distinguish between various kinds of units. For example, in the U.S., the unit number is generally required to come with a unit designator, and generic substitutes like “#” aren’t acceptable. Many mappers and imports have simply dropped the designator on the floor, but there’s nontrivial undocumented use of each of the common designators as subkeys: addr:building=*, addr:room=*, etc.

At the very least, I would add addr:floor=*, which is documented and quite common. For the rest, you’ll probably want to decide whether you like having a separate column for each possible designator or would prefer something more unified like addr:unit:label=*.

1ec5 · 2025-09-18T07:17:55Z

src/addresses.py

+
+class AddressesWriter(GeoParquetWriter):
+    COLUMNS = [
+        ("addr:housenumber", pyarrow.string()),


When a feature has many house numbers, some mappers and imports enumerate all of them, while others express a range. Is there any interest in either parsing value lists and ranges into something more structured or consolidating them into a human-readable value? Same for addr:unit=*.

/ref openmaptiles/openmaptiles#1562

First pass at address layer

d8934df

ianthetechie mentioned this pull request Jul 14, 2025

Postal code layer approach(es) #12

Open

ianthetechie mentioned this pull request Jul 22, 2025

Support more fine-grained schema definition (non-string types) #7

Merged

ianthetechie added 4 commits July 22, 2025 15:03

Merge branch 'schema-definition-improvements' into address-layer

fb56a28

Update address layer with new schema structure

0ed8f7d

Merge remote-tracking branch 'origin/main' into address-layer

3add6c4

Update addresses layer to use new conventions

f98c7e7

1ec5 reviewed Sep 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Address layer #10

Address layer #10

Uh oh!

ianthetechie commented Jul 9, 2025

Uh oh!

ianthetechie commented Sep 14, 2025

Uh oh!

missinglink commented Sep 15, 2025

Uh oh!

ianthetechie commented Sep 18, 2025

Uh oh!

1ec5 Sep 18, 2025

Uh oh!

1ec5 Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Address layer #10

Are you sure you want to change the base?

Address layer #10

Uh oh!

Conversation

ianthetechie commented Jul 9, 2025

Uh oh!

ianthetechie commented Sep 14, 2025

Uh oh!

missinglink commented Sep 15, 2025

Uh oh!

ianthetechie commented Sep 18, 2025

Uh oh!

1ec5 Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

1ec5 Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants