Simple filesystem-based structured storage for data with metadata
Shelfie helps you organize your data files in a structured, hierarchical way while automatically managing metadata. Think of it as a filing system that creates organized directories based on your data's characteristics and keeps track of important information about each dataset.
- Organized: Automatically creates directory structures based on your data's fields
- Metadata-aware: Stores attributes alongside your data files
- Flexible: Works with any data that can be saved as CSV, JSON, or pickle
- Simple: Intuitive API for creating and reading structured datasets
- Discoverable: Easy to browse and understand your data organization in the filesystem
Shelfie is meant to be an in between a full database and having to create a wrapper for a filesystem based storage for each project.
Shelfie translates database-style relationships into filesystem organization:
Database Thinking β Filesystem Result
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tables: [experiments] β Directory Level 1
Tables: [models] β Directory Level 2
Tables: [dates] β Directory Level 3
Columns: epochs, lr β metadata.json
Data: results.csv β Attached files
Root Directory
βββ .shelfie.pkl # Shelf configuration
βββ experiment_1/ # Field 1 value
β βββ random_forest/ # Field 2 value
β β βββ 2025-06-12/ # Field 3 value (auto-generated date)
β β β βββ metadata.json # Stored attributes
β β β βββ results.csv # Your data files
β β β βββ model.pkl # More data files
β β βββ gradient_boost/
β β βββ 2025-06-12/
β β βββ metadata.json
β β βββ results.csv
β βββ neural_network/
β βββ 2025-06-12/
β βββ metadata.json
β βββ predictions.csv
βββ experiment_2/
βββ ...
Shelfie = Filesystem-Based Relational Design
- Fields β Directory hierarchy (what you'd normalize into separate tables)
- Attributes β Stored metadata (what you'd store as columns in those tables)
- Data β Files attached to each record (the actual data your database would reference)
- File Paths β Automatically tracked as
filename_path__in metadata
Traditional Database:
SELECT r.accuracy, e.name, m.type, r.epochs
FROM results r
JOIN experiments e ON r.experiment_id = e.id
JOIN models m ON r.model_id = m.id
WHERE e.date = '2025-06-12'Shelfie Equivalent:
data = load_from_shelf("./experiments")
results_df = data['results'] # Already has experiment, model, date columns!
filtered = results_df[results_df['date'] == '2025-06-12']pip install shelfieimport pandas as pd
from shelfie import Shelf, DateField
# Create a shelf for ML experiments
ml_shelf = Shelf(
root="./experiments",
fields=["experiment", "model", DateField("date")], # Directory structure
attributes=["epochs", "learning_rate"] # Required metadata
)
# Create a new experiment record
experiment = ml_shelf.create(
experiment="baseline",
model="mlp",
epochs=100,
learning_rate=0.001 # Typical learning rate for neural networks
)
# Attach your results
results_df = pd.DataFrame({
"accuracy": [0.85, 0.87, 0.89],
"loss": [0.45, 0.32, 0.28],
"epoch": [1, 2, 3]
})
experiment.attach(results_df, "results.csv")This creates:
experiments/
βββ baseline/
βββ mlp/
βββ 2025-06-12/
βββ metadata.json # {"epochs": 100, "learning_rate": 0.001, "results_path__": "/path/to/results.csv"}
βββ results.csv # Your data
Think of this as three related database tables:
experimentstable βprojectfieldmodelstable βmodel_typefieldrunstable βdatefield- Attributes:
dataset,hyperparams,notes
from shelfie import Shelf, DateField, TimestampField
import pandas as pd
# Set up experiment tracking (defines your "table" relationships)
experiments = Shelf(
root="./ml_experiments",
fields=["project", "model_type", DateField("date")], # Your table hierarchy
attributes=["dataset", "hyperparams", "notes"] # Your table columns
)
# Log different experiments
mlp_experiment = experiments.create(
project="customer_churn",
model_type="mlp",
dataset="v2_cleaned",
hyperparams={"hidden_layers": [128, 64, 32], "dropout": 0.3, "activation": "relu"},
notes="Multi-layer perceptron with dropout regularization"
)
# Attach multiple files
mlp_experiment.attach(train_results, "training_metrics.csv")
mlp_experiment.attach(test_results, "test_results.csv")
mlp_experiment.attach(feature_importance, "feature_importance.csv")
# Try a different model
cnn_experiment = experiments.create(
project="customer_churn",
model_type="cnn",
dataset="v2_cleaned",
hyperparams={"filters": [32, 64, 128], "kernel_size": 3, "learning_rate": 0.0001},
notes="Convolutional neural network approach"
)Database equivalent:
regionstable βregionfieldtime_periodstable βyear,quarterfields- Attributes:
analyst,report_type,data_source
# Organize sales data by geography and time (multi-table relationship)
sales_shelf = Shelf(
root="./sales_data",
fields=["region", "year", "quarter"], # Geographic + temporal tables
attributes=["analyst", "report_type", "data_source"] # Report metadata columns
)
# Store Q1 data for North America
na_q1 = sales_shelf.create(
region="north_america",
year="2025",
quarter="Q1",
analyst="john_doe",
report_type="quarterly_summary",
data_source="salesforce"
)
sales_data = pd.DataFrame({
"product": ["A", "B", "C"],
"revenue": [150000, 200000, 180000],
"units_sold": [1500, 2000, 1800]
})
na_q1.attach(sales_data, "quarterly_sales.csv")Database tables: survey_types β demographics β timestamps
# Organize survey responses by type and demographics
surveys = Shelf(
root="./survey_data",
fields=["survey_type", "demographic", TimestampField("timestamp")], # Survey taxonomy
attributes=["sample_size", "methodology", "response_rate"] # Survey metadata
)
# Store customer satisfaction survey
survey = surveys.create(
survey_type="customer_satisfaction",
demographic="millennials",
sample_size=1000,
methodology="online_panel",
response_rate=0.23
)
responses = pd.DataFrame({
"question_id": [1, 2, 3, 4, 5],
"avg_score": [4.2, 3.8, 4.1, 3.9, 4.0],
"response_count": [920, 915, 898, 901, 911]
})
survey.attach(responses, "responses.csv")The Magic: Automatic JOIN Operations
Unlike databases where you need explicit JOINs, Shelfie automatically combines your "table" relationships:
from shelfie import load_from_shelf
# Load all data from experiments shelf
data = load_from_shelf("./ml_experiments")
# Returns a dictionary of DataFrames - like running multiple JOINed queries:
# {
# 'metadata': All experiment metadata with project+model+date info,
# 'training_metrics': Training data with experiment context automatically joined,
# 'test_results': Test data with experiment context automatically joined,
# ...
# }
# Analyze all your experiments - no JOINs needed!
print(data['metadata']) # Overview of all experiments
print(data['training_metrics']) # All training metrics with full context
# Note: File paths are stored as filename_path__ columns (e.g., 'training_metrics_path__')What you get automatically:
- Denormalized DataFrames: Each CSV gets experiment+model+date columns added
- Full Context: Every row knows its complete "relational" context
- No JOIN complexity: Relationships are already materialized
- Pandas-ready: Immediate analysis without SQL knowledge
Each DataFrame automatically includes:
- Original data columns: Your actual data
- Attribute columns: Metadata from your "table columns" (hyperparams, notes, etc.)
- Field columns: Directory structure as relational context (project, model_type, date)
- File path columns: References as
filename_path__columns
from shelfie import Field, DateField, TimestampField
# Field with a default value
shelf = Shelf(
root="./data",
fields=[
"experiment",
Field("environment", default="production"), # Always "production" unless specified
DateField("date"), # Auto-generates today's date
TimestampField("timestamp") # Auto-generates current timestamp
],
attributes=["version"]
)
# Only need to specify non-default fields
record = shelf.create(
experiment="test_1",
version="1.0"
)
# Creates: ./data/test_1/production/2025-06-12/2025-06-12_14-30-45/
# Metadata includes: version_path__ for any attached files# Attach different file types
record.attach(results_df, "results.csv") # CSV
record.attach(model_config, "config.json") # JSON
record.attach(trained_model, "model.pkl") # Pickle
record.attach(report_text, "summary.txt") # Text# Load a shelf that was created elsewhere
existing_shelf = Shelf.load_from_root("./experiments")
# Continue adding to it
new_experiment = existing_shelf.create(
experiment="advanced",
model="transformer",
epochs=50,
learning_rate=0.0001 # Lower learning rate for transformer models
)my_project/
βββ experiment1_mlp_results.csv
βββ experiment1_mlp_model.pkl
βββ experiment2_cnn_results.csv
βββ experiment2_cnn_model.pkl
βββ baseline_test_data.csv
βββ advanced_test_data.csv
βββ notes.txt # Which file belongs to what?
my_project/
βββ baseline/
β βββ mlp/
β β βββ 2025-06-12/
β β βββ metadata.json # {"epochs": 100, "lr": 0.001, "results_path__": "/path/results.csv"}
β β βββ results.csv
β β βββ model.pkl
β βββ cnn/
β βββ 2025-06-12/
β βββ metadata.json # {"epochs": 200, "lr": 0.0001, "results_path__": "/path/results.csv"}
β βββ results.csv
β βββ model.pkl
βββ advanced/
βββ transformer/
βββ 2025-06-12/
βββ metadata.json # {"epochs": 50, "lr": 0.0001, "results_path__": "/path/results.csv"}
βββ results.csv
βββ model.pkl
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License.
Happy organizing! πβ¨