GH-42018: Add numpy.StringDType support #48391

alippai · 2025-12-07T23:35:00Z

Rationale for this change

Implement #42018

What changes are included in this PR?

Conversion in numpy->arrow direction with multiple string types

Are these changes tested?

Two basic conversion tests added

Are there any user-facing changes?

Yes, adds support to numpy.StringDType as source

I'm not sure what the AI policy is for apache/arrow, this PR was created using OpenAI Codex.

cc @jorisvandenbossche as he opened the original issue

GitHub Issue: [Python] Conversion to/from numpy 2.0+ new StringDType #42018

github-actions · 2025-12-07T23:35:23Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2025-12-07T23:37:23Z

⚠️ GitHub issue #42018 has been automatically assigned in GitHub to PR creator.

alippai · 2025-12-08T06:23:45Z

I have no idea if this is the way to implement numpy compat in arrow. Also I’ll add a better test case for the na_value param.

pitrou · 2025-12-08T15:11:22Z

I'm not sure what the AI policy is for apache/arrow, this PR was created using OpenAI Codex.

Did you review the code to ensure it was/looked correct?

alippai · 2025-12-08T16:33:18Z

@pitrou yes! I don’t work with C++ code professionally, so I lack the knowledge to know if eg the numpy 2.0+ feature usage is actually good or ridiculous.

This was my first attempt and experiment with using AI for OSS development. The issue turned out to be more complex than I initially expected. The PR is still small enough, so I’d appreciate the feedback if it’s good direction and I totally understand if you say this is simply not reviewable and would waste the time of the participants.

pitrou · 2025-12-08T16:37:48Z

I might try to take a look but I will first have to get acquainted with the StringDType.

rok · 2025-12-08T16:41:35Z

@jorisvandenbossche might have insight.

WillAyd · 2025-12-08T16:57:25Z

cc @ngoldbaum

WillAyd · 2025-12-08T16:59:04Z

python/pyarrow/src/arrow/python/numpy_to_arrow.cc

+inline npy_string_allocator* ArrowNpyString_acquire_allocator(
+    const PyArray_StringDTypeObject* descr) {
+  using Func = npy_string_allocator* (*)(const PyArray_StringDTypeObject*);
+  auto func = reinterpret_cast<Func>(PyArray_API[316]);


Are these functions exposed or are they only accessible through the PyArray_API?

This is a reimplementation of the public api as the numpy api version used in pyarrow was below <2.0 and these were “hidden”. This way the code compiles with both old and new numpy. I didn’t find a better example in pyarrow how to onboard new API. The alternative is dropping the numpy 1.26 support (which might be allowed if pyarrow follows SPEC 0)

The idea is unfortunately right, I think, since you can't put this in a different compilation unit with NPY_TARGET_VERSION=NPY_2_0_API_VERSION and you can't switch NPY_TARGET_VERSION within the same compilation unit.

Maybe for these specific functions that was actually unnecessary (and if you/@ngoldbaum likes, we could make them always defined in 2.5, it would just segfault if you ever use it, but we/I missed that it is impossible for there to be something to use it on).
That way, it might only be available if build with NumPy 2.5, but that is probably OK in practice, since that is what official builds will use (except maybe for some old Python version).

That said NPY_ABI_VERSION >= 0x02000000 isn't good here. If anything it should be NPY_FEATURE_VERSION (which you can enforce via NPY_TARGET_VERSION to indicate a minimal version of NumPy you support at runtime).

While think it's correct to hack it in this vain, I would suggest to guard it in a way that uses the custom version only when necessary and makes it obvious how to clean it up in the future.
(Even just copy-paste the C definitions with a #ifndef ...!?)

If NumPy offers the required defines (which it will at least on newer versions or with NPY_TARGET_VERSION=NPY_2_0_API_VERSION it is better to stop using it, there is no guarantee it will remain correct for ever.

Removed most of the guards. As I understand the feature and target versions could be lower, but correct me if I’m wrong, please

ngoldbaum · 2025-12-08T19:25:46Z

I can take a look at this but I'll keep in mind an AI wrote it...

python/pyarrow/src/arrow/python/numpy_convert.cc

python/pyarrow/src/arrow/python/numpy_to_arrow.cc

ngoldbaum · 2025-12-08T19:36:36Z

python/pyarrow/src/arrow/python/numpy_to_arrow.cc

+Status NumPyConverter::AppendStringDTypeValues(Builder* builder) {
+  auto* descr = reinterpret_cast<PyArray_StringDTypeObject*>(dtype_);
+
+  npy_string_allocator* allocator = ArrowNpyString_acquire_allocator(descr);


FYI for other reviewers: this locks a mutex internally in NumPy.

python/pyarrow/src/arrow/python/numpy_to_arrow.cc

python/pyarrow/tests/test_array.py

alippai · 2025-12-08T21:29:29Z

@ngoldbaum thanks for the review. I’ll address them as soon as I figure out what to do with the numpy versions. Also I appreciate the open minded perspective for the AI, I did my best to only submit something what works and doesn’t have unnecessary code.

alippai · 2026-01-08T06:17:54Z

@jorisvandenbossche @raulcd do you think this can go into the next release or is it not realistic?

raulcd · 2026-01-08T09:46:31Z

do you think this can go into the next release or is it not realistic?

The feature freeze is currently scheduled for Monday 12th. I don't have enough knowledge around this to provide a meaningful review before that as I would have to investigate around stringDType. Maybe if @WillAyd / @pitrou / @ngoldbaum have time before that.

pitrou · 2026-01-08T09:28:10Z

python/pyarrow/src/arrow/python/numpy_convert.cc

+#ifndef NPY_VSTRING
+#  define NPY_VSTRING 2056
+#endif


Do we actually need this? Can IsStringDType just return false if this constant is not defined (this is presumably when compiling with NumPy < 2)?

pitrou · 2026-01-08T09:28:52Z

python/pyarrow/src/arrow/python/numpy_convert.cc

 }

+bool IsStringDType(PyArray_Descr* descr) {
+  return descr != nullptr && descr->type_num == NPY_VSTRING;


The nullptr check seems superfluous, why would one call this function with a null pointer?

pitrou · 2026-01-08T09:33:04Z

python/pyarrow/src/arrow/python/numpy_to_arrow.cc

+inline npy_string_allocator* ArrowNpyString_acquire_allocator(
+    const PyArray_StringDTypeObject* descr) {
+  using Func = npy_string_allocator* (*)(const PyArray_StringDTypeObject*);
+  return reinterpret_cast<Func>(PyArray_API[316])(descr);


Why not call NpyString_acquire_allocator directly? AFAICT it is a macro that does roughly what your code is doing:

#if NPY_FEATURE_VERSION >= NPY_2_0_API_VERSION #define NpyString_acquire_allocator \ (*(npy_string_allocator * (*)(const PyArray_StringDTypeObject *)) \ PyArray_API[316]) #endif

pitrou · 2026-01-08T09:40:50Z

python/pyarrow/src/arrow/python/numpy_to_arrow.cc


 namespace {

+#if NPY_ABI_VERSION >= 0x02000000


I'm not sure why we're guarding with NPY_ABI_VERSION rather than either NPY_VERSION or NPY_FEATURE_VERSION. Can a NumPy maintainer explain how these 3 macros differ? @ngoldbaum @seberg

(also, would be nice if the NumPy docs were a bit more talkative about this)

ABI version is the compile time header version of NumPy. NPY_FEATURE_VERSION is the runtime one you are compiling for.
I.e. by default API that is only available on newer versions is disabled so you can't compile to run with 1.x support but use newer API.

In this particular case (effectively using future API) #if NPY_ABI_VERSION >= 0x02000000 may tell you that a definition is already included in the header, in this case I guess PyArray_StringDTypeObject.
There is no reason to hide such a definition, so we don't (the thing to hide is mostly API table entries).

Maybe there should be a section on "how to use future API depending on the NumPy runtime version" (although for things we really expect it, we may want to add it to the npy2_compat.h header ourselves instead).

FWIW, I still think it makes most sense to wholesale copy-paste the NumPy header definitions. Then add some form of guard (and be it #if NPY_FEATURE_VERSION < NUMPY_2_0_VERSION or what it was), so that when it is exposed by the NumPy headers you stop using it.

We don't seem to define NPY_FEATURE_VERSION anywhere, so that does mean the 2.0 macros are not available in the NumPy headers? If we define NPY_FEATURE_VERSION to 2.0, does that mean that the produced extension with not be compiled with NumPy 1.x?

It would be nice if these interactions were made clearer somewhere.

If we define NPY_FEATURE_VERSION to 2.0, does that mean that the produced extension with not be compiled with NumPy 1.x?

Yeah, except what you define would be NPY_TARGET_VERSION. Don't ask me why I used a different name, it was probably silly, but it is what we have now.
I think NPY_TARGET_VERSION is described. But this way of using future API conditionally is not (the only place is that npy_2_compat.h header in NumPy itself uses similar conditionals).

(also, can we assume that NPY_FEATURE_VERSION is always <= NPY_ABI_VERSION? and what about NPY_VERSION, is it something different?)

pitrou · 2026-01-08T09:43:10Z

python/pyarrow/src/arrow/python/numpy_to_arrow.cc

+    return Status::NotImplemented(
+        "NumPy StringDType requires building PyArrow with NumPy >= 2.0");


Is this at all possible to happen? I got the impression that one cannot use NumPy 2 if PyArrow was compiled for NumPy < 2, am I mistaken @ngoldbaum @seberg ?

That is correct, the NumPy C-API import will just barf at you and refuse to run.

pitrou · 2026-01-08T10:03:23Z

python/pyarrow/tests/test_array.py


+@pytest.mark.numpy
+def test_array_from_numpy_string_dtype():
+    dtypes_mod = getattr(np, "dtypes", None)


Can use something like dtypes = pytest.importorskip("numpy.dtypes")

Or, even, better, you can use a pytest fixture:

@pytest.fixture def string_dtype(): dtypes = pytest.importorskip("numpy.dtypes") dtype_class = getattr(dtypes, "StringDType", None) if dtype_class is None: pytest.skip("NumPy StringDType not available (NumPy > 2 needed)") return dtype_class

and then simply:

@pytest.mark.numpy def test_array_from_numpy_string_dtype(string_dtype): arr = np.array(["some", "strings"], dtype=string_dtype()) # etc.

pitrou · 2026-01-08T10:04:09Z

python/pyarrow/tests/test_array.py

+    assert arrow_arr.type == pa.utf8()
+    assert arrow_arr.to_pylist() == ["some", "strings"]
+
+    arrow_arr = pa.array(arr, type=pa.string())


Note that pa.string() and pa.utf8() are the same thing, it's a bit confusing to use both spellings here.

pitrou · 2026-01-08T10:12:20Z

python/pyarrow/tests/test_array.py

+    assert arrow_arr.to_pylist() == ["some", "strings"]
+
+    arrow_arr = pa.array(arr, type=pa.string())
+    assert arrow_arr.type == pa.string()


Can you also call arrow_arr.validate(full=True)? (also in other places)

pitrou · 2026-01-08T10:12:48Z

python/pyarrow/tests/test_array.py

+
+    arrow_arr = pa.array(arr, type=pa.string())
+    assert arrow_arr.type == pa.string()
+    assert arrow_arr.to_pylist() == ["some", "strings"]


Can also be written assert arrow_arr.to_pylist() == arr.tolist()

pitrou · 2026-01-08T10:13:18Z

python/pyarrow/tests/test_array.py

+    mask = np.array([False, False, True], dtype=bool)
+    arrow_arr = pa.array(arr, mask=mask)
+    assert arrow_arr.type == pa.utf8()
+    assert arrow_arr.null_count == 2


Just validate arrow_arr and it will check null_count consistency.

Fix StringDType helper declaration and initialize UTF8

3b49f62

alippai requested review from AlenkaF, raulcd and rok as code owners December 7, 2025 23:35

github-actions bot added Component: Python awaiting review Awaiting review labels Dec 7, 2025

alippai changed the title ~~Fix StringDType helper declaration and initialize UTF8~~ GH-42018: Add numpy.StringDType support Dec 7, 2025

alippai added 5 commits December 7, 2025 19:28

Fix NumPy string dtype allocator guard

6e4c3c6

Remove StringDType header comment

a90ea23

Format numpy_to_arrow include

8729eb3

Run clang-format on numpy_to_arrow

f49ba67

Handle missing NumPy dtypes module in StringDType tests

050ca86

WillAyd reviewed Dec 8, 2025

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 8, 2025

ngoldbaum suggested changes Dec 8, 2025

View reviewed changes

alippai mentioned this pull request Dec 10, 2025

Add a .to_polars_df() method (very similar to .to_dataframe(), which implicitly uses pandas) pydata/xarray#10135

Open

alippai added 2 commits December 12, 2025 00:11

Make StringDType support unconditional

da255c9

Remove StringDType endif comments

80a3aca

alippai requested a review from ngoldbaum December 12, 2025 05:46

alippai requested a review from seberg December 12, 2025 05:46

Add StringDType mask coverage and sentinel test

bef2c71

alippai requested a review from WillAyd December 12, 2025 06:25

Merge branch 'apache:main' into main

166dd05

alippai mentioned this pull request Dec 25, 2025

GH-42018: [Python][C++] Arrow sting types conversion to numpy StringDType #48647

Draft

pitrou reviewed Jan 8, 2026

View reviewed changes

		return Status::NotImplemented(
		"NumPy StringDType requires building PyArrow with NumPy >= 2.0");

GH-42018: Add numpy.StringDType support #48391

Are you sure you want to change the base?

GH-42018: Add numpy.StringDType support #48391

Conversation

alippai commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Dec 7, 2025

Uh oh!

github-actions bot commented Dec 7, 2025

Uh oh!

alippai commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Dec 8, 2025

Uh oh!

alippai commented Dec 8, 2025

Uh oh!

pitrou commented Dec 8, 2025

Uh oh!

rok commented Dec 8, 2025

Uh oh!

WillAyd commented Dec 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alippai Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alippai commented Dec 8, 2025

Uh oh!

alippai commented Jan 8, 2026

Uh oh!

raulcd commented Jan 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alippai commented Dec 7, 2025 •

edited

Loading

alippai commented Dec 8, 2025 •

edited

Loading

alippai Dec 12, 2025 •

edited

Loading

ngoldbaum commented Dec 8, 2025 •

edited

Loading