cellarr-frame

A high-level Python package for managing DataFrames using TileDB as a backing store. This package provides two distinct, storage strategies for your data.

DenseCellArrayFrame: For standard DataFrames. Uses TileDB's native 1D array, multi-attribute storage. This is highly efficient for dataframes with columns of mixed types (e.g., numbers, strings, dates).
SparseCellArrayFrame: For sparse DataFrames. Uses a 2D sparse cellarr-array to store data in a "coordinate" (COO) format. This is ideal for very large DataFrames where most values are NaN or 0 (e.g., gene-cell matrices).

Installation

To get started, install the package from PyPI

pip install cellarr-frame

Factory Function: `create_cellarr_frame`

The easiest way to get started is with the create_cellarr_frame factory. It automatically builds the correct TileDB array schema based on an initial DataFrame or specified dim_dtypes.

from cellarr_frame import create_cellarr_frame

# Example 1: Create a DENSE frame by providing an initial DataFrame
df = pd.DataFrame({'A': np.arange(5), 'B': [f'val_{i}' for i in range(5)]})
create_cellarr_frame("my_dense_frame.tdb", sparse=False, df=df)

# Example 2: Create an EMPTY SPARSE frame with integer-based dimensions
create_cellarr_frame("my_sparse_frame_int.tdb", sparse=True, dim_dtypes=[np.uint64, np.uint64])

# Example 3: Create an EMPTY SPARSE frame with string-based dimensions
create_cellarr_frame("my_sparse_frame_str.tdb", sparse=True, dim_dtypes=[str, str])

`DenseCellArrayFrame` (Native DataFrames)

This is the best/standard choice for typical, dense dataframes.

Writing and Appending

This class is designed for efficient appends. The create_cellarr_frame function (or write_dataframe) writes the first chunk, and append_dataframe adds new rows to the end.

import pandas as pd
import numpy as np
from cellarr_frame import create_cellarr_frame, DenseCellArrayFrame

# 1. Create and write the first DataFrame
df1 = pd.DataFrame({
    'A': np.arange(5, dtype=np.int32),
    'B': np.random.rand(5),
    'C': ['foo' + str(i) for i in range(5)]
})
create_cellarr_frame("dense.tdb", sparse=False, df=df1)

# 2. Open the frame and append a second DataFrame
cdf = DenseCellArrayFrame("dense.tdb")
print(f"Shape before append: {cdf.shape}")

df2 = pd.DataFrame({
    'A': np.arange(5, 10, dtype=np.int32),
    'B': np.random.rand(5),
    'C': ['bar' + str(i) for i in range(5)]
})
cdf.append_dataframe(df2)

print(f"Shape after append: {cdf.shape}")

# Shape before append: (5, 3)
# Shape after append: (10, 3)

Reading and Querying

You can read the full DataFrame or query it using standard Python slicing.

# 1. Read the full DataFrame
full_df = cdf.read_dataframe()
print(full_df)

#     A         B      C
# 0   0  0.123456   foo0
# 1   1  0.234567   foo1
# ...
# 8   8  0.456789   bar3
# 9   9  0.567890   bar4

# 2. Querying with __getitem__

# Get specific rows (exclusive slice, like pandas)
row_subset = cdf[5:8]
#    A         B      C
# 5  5  0.345678   bar0
# 6  6  0.456789   bar1
# 7  7  0.567890   bar2

# Get a single column
col_A = cdf['A']
#    A
# 0  0
# 1  1
# ...

# Get multiple columns
cols_AC = cdf[['A', 'C']]
#    A      C
# 0  0   foo0
# 1  1   foo1
# ...

# Get specific rows and columns
subset = cdf[1:3, ['A', 'C']]
#    A      C
# 1  1   foo1
# 2  2   foo2

Properties

print(f"Shape: {cdf.shape}")       # (10, 3)
print(f"Columns: {cdf.columns}")   # Index(['A', 'B', 'C'], dtype='object')
print(f"Index: {cdf.index}")       # RangeIndex(start=0, stop=10, step=1)

2. `SparseCellArrayFrame` (Sparse DataFrames)

This is the best choice for data that is mostly empty (NaN). It only stores the values that exist, saving significant space.

Writing and Appending

Writing to a sparse frame involves stack()-ing the DataFrame to find all non-NaN values and writing them to the 2D array.

import pandas as pd
import numpy as np
from cellarr_frame import create_cellarr_frame, SparseCellArrayFrame

# 1. Create a sparse DataFrame (most values are NaN)
df1 = pd.DataFrame({
    0: [1.0, np.nan],  # Index 0, 1
    1: [np.nan, 2.0]
})

# Create the array and write the data
# We specify integer dtypes for the dimensions (row/col labels)
create_cellarr_frame("sparse.tdb", sparse=True, df=df1, dim_dtypes=[np.uint64, np.uint64])

# 2. Open the frame and append new data
sdf = SparseCellArrayFrame("sparse.tdb")
print(f"Shape before append: {sdf.shape}")

# This new DataFrame will be appended starting at the next available row index
df2 = pd.DataFrame({
    1: [3.0, np.nan],  # Relative index 0, 1
    2: [np.nan, 4.0]
})
sdf.append_dataframe(df2) # Automatically appends at rows 2 and 3

print(f"Shape after append: {sdf.shape}")

# Shape before append: (2, 2)
# Shape after append: (4, 3)

Reading and Querying

Reading reconstructs the DataFrame from the sparse coordinates.

# 1. Read the full DataFrame
full_df = sdf.read_dataframe()
print(full_df)

#      0    1    2
# 0  1.0  NaN  NaN
# 1  NaN  2.0  NaN
# 2  NaN  3.0  NaN
# 3  NaN  NaN  4.0

# 2. Querying with __getitem__

# Get specific rows
row_subset = sdf[1:3]
#      0    1    2
# 1  NaN  2.0  NaN
# 2  NaN  3.0  NaN

# Get specific columns (by label)
col_subset = sdf[[0, 2]]
#      0    2
# 0  1.0  NaN
# 1  NaN  NaN
# 2  NaN  NaN
# 3  NaN  4.0

# Get specific rows and columns
subset = sdf[0:2, [1]]
#      1
# 0  NaN
# 1  2.0

String Dimensions

SparseCellArrayFrame also fully supports string-based row and column labels.

# Create with string dimensions
create_cellarr_frame("sparse_str.tdb", sparse=True, dim_dtypes=[str, str])
sdf_str = SparseCellArrayFrame("sparse_str.tdb")

# Write DataFrame with string index/columns
df_str1 = pd.DataFrame({'col_A': [1.0, np.nan]}, index=['row_A', 'row_B'])
sdf_str.write_dataframe(df_str1)

# Appending with string dimensions just adds the new coordinates
df_str2 = pd.DataFrame({'col_B': [3.0]}, index=['row_C'])
sdf_str.append_dataframe(df_str2)

print(sdf_str.read_dataframe())
#        col_A  col_B
# row_A    1.0    NaN
# row_C    NaN    3.0

Note

row_B is missing since all the values are NaN for this column.

Properties

Properties on sparse frames query the array to find the unique dimension labels.

print(f"Shape: {sdf_str.shape}")       # (3, 2)
print(f"Columns: {sdf_str.columns}")   # Index(['col_A', 'col_B'], dtype='object')
print(f"Index: {sdf_str.index}")       # Index(['row_A', 'row_B', 'row_C'], dtype='object')

Note

This project has been set up using BiocSetup and PyScaffold.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
docs		docs
src/cellarr_frame		src/cellarr_frame
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
AUTHORS.md		AUTHORS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

cellarr-frame

Installation

Factory Function: `create_cellarr_frame`

`DenseCellArrayFrame` (Native DataFrames)

Writing and Appending

Reading and Querying

Properties

2. `SparseCellArrayFrame` (Sparse DataFrames)

Writing and Appending

Reading and Querying

String Dimensions

Properties

Note

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

CellArr/cellarr-frame

Folders and files

Latest commit

History

Repository files navigation

cellarr-frame

Installation

Factory Function: create_cellarr_frame

DenseCellArrayFrame (Native DataFrames)

Writing and Appending

Reading and Querying

Properties

2. SparseCellArrayFrame (Sparse DataFrames)

Writing and Appending

Reading and Querying

String Dimensions

Properties

Note

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Factory Function: `create_cellarr_frame`

`DenseCellArrayFrame` (Native DataFrames)

2. `SparseCellArrayFrame` (Sparse DataFrames)

Packages