Skip to content

[BLOG] Geospatial Blog#156

Merged
alamb merged 9 commits intoapache:productionfrom
jiayuasu:alamb/blog_template
Feb 13, 2026
Merged

[BLOG] Geospatial Blog#156
alamb merged 9 commits intoapache:productionfrom
jiayuasu:alamb/blog_template

Conversation

@jiayuasu
Copy link
Member

@jiayuasu jiayuasu commented Feb 5, 2026

I took @alamb 's template and created this PR. I hope this is ok.

This idea of this blog post is inspired by this issue and the initial draft is in this google doc.

Looking forward to having this blog post on Parquet website!

Copy link
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jiayuasu -- I think this looks great!

Here is a preview of what it looks like
Image

If anyone else is interested, here is what the navigation looks like

Screenshot 2026-02-05 at 7 01 32 AM

cc @kylebarron in case you are interested in this content as well

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Copy link

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is a great read! I left a few suggestions for you to consider.

@jiayuasu jiayuasu force-pushed the alamb/blog_template branch from f257a21 to e20bf95 Compare February 6, 2026 07:46
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jiayuasu and @alamb for driving this and thanks all for the comments! I took a read through and this is excellent.

Copy link

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been pointed out to me that the coverage matrix doesn't cover statistics/geometry bounding, without which predicate pushdown doesn't work: every rowgroup with the column needs scanning.

https://github.com/apache/parquet-java/blob/7190ab6a571c3cffd31e267042c193e90c0301ad/parquet-column/src/main/java/org/apache/parquet/column/statistics/geospatial/GeospatialStatistics.java#L99

https://github.com/apache/arrow/blob/a82edf90ce66eb9a9a9e3bbac514e5d51f531c1f/cpp/src/parquet/geospatial/util_internal.h#L181

Maybe a "what next?" paragraph

Geospatial support in Parquet is still ongoing; as of February 2026 columns statistics collection is incomplete, which means that scanning some types may require reading all the data. Furthermore the query engines themselves need to adopt the new format extensions.

What you do get now is the ability to save geospatial data in Parquet files, with support in those query engines increasing over time.

@alamb
Copy link
Collaborator

alamb commented Feb 9, 2026

It's been pointed out to me that the coverage matrix doesn't cover statistics/geometry bounding, without which predicate pushdown doesn't work: every rowgroup with the column needs scanning.

"Geospatial support in Parquet is still ongoing; as of February 2026 columns statistics collection is incomplete, which means that scanning some types may require reading all the data. Furthermore the query engines themselves need to adopt the new format extensions."

Maybe a more accurate summary is that the column statistics collection is not yet fully integrated into all engines.

FWIW the Rust Parquet implementation does handle such statistics (thanks to @kylebarron and @paleolimbot as I recall) -- https://docs.rs/parquet/latest/parquet/format/struct.GeospatialStatistics.html, and I think SedonaDB has already integrated it into their query engine as well.

Perhaps we can add a line to the https://parquet.apache.org/docs/file-format/implementationstatus/ page for these (doing so seems to have the effect of pressuring additional ecosystem adoption)

@csringhofer
Copy link

Reflecting on the discussion about incomplete statistic support.

I checked a few implementation and while writing statistics for geometries seems to be there in general, I haven't found a single implementation of geography with any edge interpolation algorithm. The rust implementation seems to handle the stats for points (where edge interpolation is not needed) and allows the user to inject its own implementation.

Maybe a more accurate summary is that the column statistics collection is not yet fully integrated into all engines.

I agree in case of geometry, but I think that it would make things clearer to mention that for geography this is incomplete, at least in common open source libraries. The blog post mentions "Spatial statistics" as core feature and generally mentions geometry and geography side by side, so the reader may assume that statistics support is widely available for both logical types. This also effect the approach to choosing the best type to use - if bounding boxes are not yet available for geography and per file skipping is critical, then the user should try to build their workload on geometry.

I don't know the status of statistics implementation of geography, but I haven't seen PRs about this, so my assumption is that it may take a significant time to have at least spherical interpolation available widely in Parquet libraries (or extension libraries). I would be happy to be proven wrong :)

Btw the blog was a great read!

@steveloughran
Copy link

@csringhofer I think @alamb's suggestion about updating the implementation status page might be a tactic, where

  • geometry and geography are somehow separated
  • stats collection is a feature to note
  • stats usage. Engine and everything which caches aggregate file stats, e.g iceberg v4 manifests

(this'd be so much easier if the flat-earthers were right, though then GPS wouldn't work so measuring locations would be a PITA)

@alamb
Copy link
Collaborator

alamb commented Feb 10, 2026

Thank you @csringhofer and @steveloughran -- I tried to capture the suggestions on how to improve the status page in a ticket:

I agree in case of geometry, but I think that it would make things clearer to mention that for geography this is incomplete, at least in common open source libraries.

I would personally think this would make the page more confusing as

  1. I think the point of this blog is to give a high level introduction to the feature from the Parquet perspective.
  2. the implementation status will likely change over time, so such a statement will become outdated

I think a separate blog describing the current state of implementation as of a certain date would be quite valuable for others evaluating potential solutions for their projects

@alamb
Copy link
Collaborator

alamb commented Feb 10, 2026

Unless there are any objections, I'll plan to update the date and merge this PR (and publish the blog) tomorrow.

@paleolimbot
Copy link
Member

I don't know the status of statistics implementation of geography, but I haven't seen PRs about this, so my assumption is that it may take a significant time to have at least spherical interpolation available widely in Parquet libraries (or extension libraries).

I think it's accurate to say that writing statistics for non-point Geography columns has not been implemented yet; however, I don't think that is inconsistent with the message we are collectively trying to put out with this post (an overview of spatial types in Parquet and celebration of the significant progress we were able to make over the last year).

@alamb
Copy link
Collaborator

alamb commented Feb 10, 2026

There seems to be some issue with the parquet site's style sheets / jquery stuff: #159

I'll try and find some time to look at this over the next day or two

@csringhofer
Copy link

I tried to capture the suggestions on how to improve the status page in a ticket:

Thanks, having a more detailed status page would be a great help for people who try to get an overview / looking for a reference implementation.

however, I don't think that is inconsistent with the message we are collectively trying to put out with this post (an overview of spatial types in Parquet and celebration of the significant progress we were able to make over the last year).

the implementation status will likely change over time, so such a statement will become outdated

I see the point - probably there are many things where the implementations could be improved besides geography statistics, and it is not in the scope of the article to go into these.

Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all your contributions, this is going to be a great post!

@alamb
Copy link
Collaborator

alamb commented Feb 11, 2026

The https://parquet.apache.org/ site is still kind of broken (at least for me), see

I think we should fix the site before we publish this post and draw more people's attention there

I have a proposed fix here that I would appreciate if someone could help review

@alamb alamb mentioned this pull request Feb 12, 2026
3. **Engine interoperability**
Because the spatial meaning is encoded as a Parquet logical type, engines do not need out of band conventions to interpret the column. A reader that understands Parquet geospatial types can immediately treat the column as a spatial object.
4. **Coordinate Reference System (CRS) information**
CRS information is stored at the file metadata (i.e., type definition) using authoritative identifiers or structured definitions such as EPSG codes or PROJJSON strings.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit "CRS information is stored in the file metadata"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I was in updating the publish date anyways, I also fixed this typo in fff1995. Hopefully that is ok @jiayuasu -- if it was a mistake I will make a PR to revert it

@alamb
Copy link
Collaborator

alamb commented Feb 13, 2026

Thanks to @vinooganesh and @emkornfield I think we are good to publish this blog now 🎉

I updated the date to Feb 13 and will merge this PR once the CI passes

@alamb alamb merged commit e550179 into apache:production Feb 13, 2026
1 check passed
@alamb
Copy link
Collaborator

alamb commented Feb 13, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blog on Geospatial

10 participants