Skip to content

BUG: pd.read_parquet() loads missing string values as None instead of NaN  #55721

@kennysong

Description

@kennysong

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# Create CSV with missing values
with open('missing_values.csv', 'w') as f:
    f.write('''
int_col_a,int_col_b,float_col,string_col,all_missing_col
1,1,1.1,'one',
2,,,,
3,3,3.3,'three',''')

# Load CSV
df = pd.read_csv('missing_values.csv')

# Notice that all missing values are NaN
df
#    int_col_a  int_col_b  float_col string_col  all_missing_col
# 0          1        1.0        1.1      'one'              NaN
# 1          2        NaN        NaN        NaN              NaN
# 2          3        3.0        3.3    'three'              NaN

# Load as Parquet
df.to_parquet('missing_values.parquet', engine='pyarrow')
df2 = pd.read_parquet('missing_values.parquet')

# Notice that there's a None in string_col
df2
#    int_col_a  int_col_b  float_col string_col  all_missing_col
# 0          1        1.0        1.1      'one'              NaN
# 1          2        NaN        NaN       None              NaN
# 2          3        3.0        3.3    'three'              NaN

Issue Description

When calling pd.read_parquet(), it loads missing values in string columns as None instead of NaN. This can cause problems in downstream ML models.

Missing values of non-string columns are correctly loaded as NaN.

Expected Behavior

Missing values in string columns should also be NaN.

Installed Versions

INSTALLED VERSIONS

commit : a60ad39
python : 3.9.6.final.0
python-bits : 64
OS : Darwin
OS-release : 22.6.0
Version : Darwin Kernel Version 22.6.0: Wed Jul 5 22:22:05 PDT 2023; root:xnu-8796.141.3~6/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.2
numpy : 1.26.1
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 58.0.4
pip : 21.2.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.10.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arrowpyarrow functionalityIO Parquetparquet, featherMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolateStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions