-
-
Notifications
You must be signed in to change notification settings - Fork 4
Address layer #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Address layer #10
Conversation
|
cc @missinglink; might be interesting for you. My short-term goal is to make this a more efficient layer generation pipeline that Pelias and other geocoders could ingest... when running with a filtered PBF, I can get a GeoParquet out in < 3 hours, which is really easy to index from there! I'll be doing some deeper verification next week, but I think I've captured most of the edge cases like street relations. |
|
hey @ianthetechie, looking good. I noticed you added a bunch of special cases for Czechia etc. 👍
Nice! I'm not sure how layercake works internally; in the past I had a success with running over the PBF file once to generate bitmasks of the target N/W/R and then doing a second pass over it to pull out the data. Generally each pass over the planet takes ~20mins, something to consider if that's not already how it works. |
Yeah 😎
Not like that 😅 It currently operates over the entire (unfiltered) planet in the OSM US hosted iteration. In our internal (Stadia Maps) pipelines, I run osmium filter in advance. This is theoretically not supposed to be needed, but it's clear that our current understanding of PyOsmium is flawed. I've opened an issue to that effect over in #20. ~20mins is about the time I see as well for an osmium tag filter too btw. That enables it to proceed entirely in memory, and spits out the parquet file in < 3 total wall clock time. From there, it's basically a matter of how fast your Elasticsearch cluster can index for geocoding. |
| ("addr:conscriptionnumber", pyarrow.string()), | ||
| ("addr:streetnumber", pyarrow.string()), | ||
| ("addr:provisionalnumber", pyarrow.string()), | ||
| ("addr:unit", pyarrow.string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addr:unit=* was originally approved for any kind of unit in an address, but it turns out that some countries’ postal systems distinguish between various kinds of units. For example, in the U.S., the unit number is generally required to come with a unit designator, and generic substitutes like “#” aren’t acceptable. Many mappers and imports have simply dropped the designator on the floor, but there’s nontrivial undocumented use of each of the common designators as subkeys: addr:building=*, addr:room=*, etc.
At the very least, I would add addr:floor=*, which is documented and quite common. For the rest, you’ll probably want to decide whether you like having a separate column for each possible designator or would prefer something more unified like addr:unit:label=*.
|
|
||
| class AddressesWriter(GeoParquetWriter): | ||
| COLUMNS = [ | ||
| ("addr:housenumber", pyarrow.string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a feature has many house numbers, some mappers and imports enumerate all of them, while others express a range. Is there any interest in either parsing value lists and ranges into something more structured or consolidating them into a human-readable value? Same for addr:unit=*.
Opening as a draft since this builds on #7 and there are quite a few things open for discussion (marked clearly with TODO comments).