Skip to content

ENH: Add convenience API to summarize null counts grouped by dtype (e.g. df.dtype_nulls.summary()) #62833

@Princu1999

Description

@Princu1999

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Add a small convenience API to provide a quick, per-dtype view of missing values in a DataFrame. The utility should list columns grouped by dtype with null counts and optional null percentages, and return both a one-row-per-dtype summary and a per-dtype detail table (columns + null counts).

This is a diagnostic convenience (similar in spirit to df.info(show_counts=True) but grouped by dtype and returning programmatic output).

Feature Description

Add a DataFrame accessor that provides a compact, programmatic summary of missing values grouped by column dtype.

@pd.api.extensions.register_dataframe_accessor("dtype_nulls")
class DtypeNullsAccessor:
def init(self, df):
self._df = df

def summary(self, include_pct: bool = True, sort_desc: bool = True):
    """
    Return (summary_df, detail_dict)

    Parameters
    ----------
    include_pct : bool, default True
        Include null_pct columns (percentage of nulls relative to len(df)).
    sort_desc : bool, default True
        Sort per-dtype detail tables by null_count descending when True.

    Returns
    -------
    summary_df : pd.DataFrame
        One row per dtype with columns:
          - dtype : str (dtype string, e.g., 'float64', 'object')
          - n_columns : int
          - cols_with_nulls : int
          - total_nulls : int
          - avg_null_pct : float (if include_pct)
    detail_dict : dict[str, pd.DataFrame]
        Mapping dtype string -> DataFrame listing columns of that dtype with
        columns ['column','null_count','null_pct'?] (null_pct present if include_pct).
    """

Implementation sketch / pseudocode:

nrows = len(df)
per_col = DataFrame({
"column": df.columns,
"dtype": df.dtypes.astype(str),
"null_count": df.isna().sum().values
})
if include_pct:
per_col["null_pct"] = per_col["null_count"] / (nrows if nrows else 1) * 100

detail = { dtype: g.sort_values("null_count", ascending=not sort_desc).reset_index(drop=True)
for dtype, g in per_col.groupby("dtype") }

agg = per_col.groupby("dtype").agg(
n_columns=("column","count"),
cols_with_nulls=("null_count", lambda s: (s>0).sum()),
total_nulls=("null_count","sum")
).reset_index()

if include_pct:
agg["avg_null_pct"] = per_col.groupby("dtype")["null_pct"].mean().values

return agg, detail

Expected behaviour / examples:

df = pd.DataFrame({
"a": [1, None, 3],
"b": [None, None, 2.0],
"c": ["x","y", None],
"d": [True, False, True]
})
summary, detail = df.dtype_nulls.summary()

summary: rows for 'float64', 'object', 'bool' with counts and percentages

detail['float64'] lists columns 'b' and 'a' with null_count and null_pct

Alternative Solutions

One-liner / ad-hoc: Users can already compute this with a short snippet:

(pd.DataFrame({'dtype': df.dtypes.astype(str), 'nulls': df.isna().sum()})
.reset_index()
.groupby('dtype')[['index','nulls']])

Additional Context

Related design rationale:

This feature is a convenience diagnostic that complements df.info() and profiling packages; it returns programmatic data structures (DataFrames and dict) so downstream tooling and tests can consume results.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Describe/info/etcobj.describe, obj.info, requests for methods that look similarEnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions