Skip to content
/ shelfie Public

Easy to use python package for organized hierarchical filestorage

License

Notifications You must be signed in to change notification settings

maxrdu/shelfie

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“š Shelfie

Simple filesystem-based structured storage for data with metadata

Shelfie helps you organize your data files in a structured, hierarchical way while automatically managing metadata. Think of it as a filing system that creates organized directories based on your data's characteristics and keeps track of important information about each dataset.

🎯 Why Shelfie?

  • Organized: Automatically creates directory structures based on your data's fields
  • Metadata-aware: Stores attributes alongside your data files
  • Flexible: Works with any data that can be saved as CSV, JSON, or pickle
  • Simple: Intuitive API for creating and reading structured datasets
  • Discoverable: Easy to browse and understand your data organization in the filesystem

Shelfie is meant to be an in between a full database and having to create a wrapper for a filesystem based storage for each project.

πŸ—οΈ How It Works

Conceptual Model: Database Relations β†’ Directory Structure

Shelfie translates database-style relationships into filesystem organization:

Database Thinking          β†’    Filesystem Result
─────────────────────────────────────────────────────
Tables: [experiments]      β†’    Directory Level 1
Tables: [models]           β†’    Directory Level 2  
Tables: [dates]            β†’    Directory Level 3
Columns: epochs, lr        β†’    metadata.json
Data: results.csv          β†’    Attached files

Visual Concept

Root Directory
β”œβ”€β”€ .shelfie.pkl                    # Shelf configuration
β”œβ”€β”€ experiment_1/                   # Field 1 value
β”‚   β”œβ”€β”€ random_forest/              # Field 2 value  
β”‚   β”‚   β”œβ”€β”€ 2025-06-12/             # Field 3 value (auto-generated date)
β”‚   β”‚   β”‚   β”œβ”€β”€ metadata.json       # Stored attributes
β”‚   β”‚   β”‚   β”œβ”€β”€ results.csv         # Your data files
β”‚   β”‚   β”‚   └── model.pkl          # More data files
β”‚   β”‚   └── gradient_boost/
β”‚   β”‚       └── 2025-06-12/
β”‚   β”‚           β”œβ”€β”€ metadata.json
β”‚   β”‚           └── results.csv
β”‚   └── neural_network/
β”‚       └── 2025-06-12/
β”‚           β”œβ”€β”€ metadata.json
β”‚           └── predictions.csv
└── experiment_2/
    └── ...

The Pattern

Shelfie = Filesystem-Based Relational Design

  1. Fields β†’ Directory hierarchy (what you'd normalize into separate tables)
  2. Attributes β†’ Stored metadata (what you'd store as columns in those tables)
  3. Data β†’ Files attached to each record (the actual data your database would reference)
  4. File Paths β†’ Automatically tracked as filename_path__ in metadata

Traditional Database:

SELECT r.accuracy, e.name, m.type, r.epochs 
FROM results r
JOIN experiments e ON r.experiment_id = e.id  
JOIN models m ON r.model_id = m.id
WHERE e.date = '2025-06-12'

Shelfie Equivalent:

data = load_from_shelf("./experiments")
results_df = data['results']  # Already has experiment, model, date columns!
filtered = results_df[results_df['date'] == '2025-06-12']

πŸš€ Quick Start

Installation

pip install shelfie

Basic Example

import pandas as pd
from shelfie import Shelf, DateField

# Create a shelf for ML experiments
ml_shelf = Shelf(
    root="./experiments",
    fields=["experiment", "model", DateField("date")],  # Directory structure
    attributes=["epochs", "learning_rate"]              # Required metadata
)

# Create a new experiment record
experiment = ml_shelf.create(
    experiment="baseline",
    model="mlp", 
    epochs=100,
    learning_rate=0.001  # Typical learning rate for neural networks
)

# Attach your results
results_df = pd.DataFrame({
    "accuracy": [0.85, 0.87, 0.89], 
    "loss": [0.45, 0.32, 0.28],
    "epoch": [1, 2, 3]
})

experiment.attach(results_df, "results.csv")

This creates:

experiments/
└── baseline/
    └── mlp/
        └── 2025-06-12/
            β”œβ”€β”€ metadata.json  # {"epochs": 100, "learning_rate": 0.001, "results_path__": "/path/to/results.csv"}
            └── results.csv    # Your data

πŸ“– Detailed Examples

1. ML Experiment Tracking

Think of this as three related database tables:

  • experiments table β†’ project field
  • models table β†’ model_type field
  • runs table β†’ date field
  • Attributes: dataset, hyperparams, notes
from shelfie import Shelf, DateField, TimestampField
import pandas as pd

# Set up experiment tracking (defines your "table" relationships)
experiments = Shelf(
    root="./ml_experiments",
    fields=["project", "model_type", DateField("date")],  # Your table hierarchy
    attributes=["dataset", "hyperparams", "notes"]        # Your table columns
)

# Log different experiments
mlp_experiment = experiments.create(
    project="customer_churn",
    model_type="mlp",
    dataset="v2_cleaned",
    hyperparams={"hidden_layers": [128, 64, 32], "dropout": 0.3, "activation": "relu"},
    notes="Multi-layer perceptron with dropout regularization"
)

# Attach multiple files
mlp_experiment.attach(train_results, "training_metrics.csv")
mlp_experiment.attach(test_results, "test_results.csv")
mlp_experiment.attach(feature_importance, "feature_importance.csv")

# Try a different model
cnn_experiment = experiments.create(
    project="customer_churn",
    model_type="cnn",
    dataset="v2_cleaned", 
    hyperparams={"filters": [32, 64, 128], "kernel_size": 3, "learning_rate": 0.0001},
    notes="Convolutional neural network approach"
)

2. Sales Data by Region and Time

Database equivalent:

  • regions table β†’ region field
  • time_periods table β†’ year, quarter fields
  • Attributes: analyst, report_type, data_source
# Organize sales data by geography and time (multi-table relationship)
sales_shelf = Shelf(
    root="./sales_data",
    fields=["region", "year", "quarter"],                    # Geographic + temporal tables
    attributes=["analyst", "report_type", "data_source"]     # Report metadata columns
)

# Store Q1 data for North America
na_q1 = sales_shelf.create(
    region="north_america",
    year="2025", 
    quarter="Q1",
    analyst="john_doe",
    report_type="quarterly_summary",
    data_source="salesforce"
)

sales_data = pd.DataFrame({
    "product": ["A", "B", "C"],
    "revenue": [150000, 200000, 180000],
    "units_sold": [1500, 2000, 1800]
})

na_q1.attach(sales_data, "quarterly_sales.csv")

3. Survey Data Organization

Database tables: survey_types β†’ demographics β†’ timestamps

# Organize survey responses by type and demographics
surveys = Shelf(
    root="./survey_data",
    fields=["survey_type", "demographic", TimestampField("timestamp")],  # Survey taxonomy
    attributes=["sample_size", "methodology", "response_rate"]            # Survey metadata
)

# Store customer satisfaction survey
survey = surveys.create(
    survey_type="customer_satisfaction",
    demographic="millennials",
    sample_size=1000,
    methodology="online_panel", 
    response_rate=0.23
)

responses = pd.DataFrame({
    "question_id": [1, 2, 3, 4, 5],
    "avg_score": [4.2, 3.8, 4.1, 3.9, 4.0],
    "response_count": [920, 915, 898, 901, 911]
})

survey.attach(responses, "responses.csv")

πŸ“Š Reading Your Data Back

The Magic: Automatic JOIN Operations

Unlike databases where you need explicit JOINs, Shelfie automatically combines your "table" relationships:

from shelfie import load_from_shelf

# Load all data from experiments shelf
data = load_from_shelf("./ml_experiments")

# Returns a dictionary of DataFrames - like running multiple JOINed queries:
# {
#   'metadata': All experiment metadata with project+model+date info,
#   'training_metrics': Training data with experiment context automatically joined,
#   'test_results': Test data with experiment context automatically joined,
#   ...
# }

# Analyze all your experiments - no JOINs needed!
print(data['metadata'])  # Overview of all experiments
print(data['training_metrics'])  # All training metrics with full context

# Note: File paths are stored as filename_path__ columns (e.g., 'training_metrics_path__')

What you get automatically:

  • Denormalized DataFrames: Each CSV gets experiment+model+date columns added
  • Full Context: Every row knows its complete "relational" context
  • No JOIN complexity: Relationships are already materialized
  • Pandas-ready: Immediate analysis without SQL knowledge

Each DataFrame automatically includes:

  • Original data columns: Your actual data
  • Attribute columns: Metadata from your "table columns" (hyperparams, notes, etc.)
  • Field columns: Directory structure as relational context (project, model_type, date)
  • File path columns: References as filename_path__ columns

πŸ› οΈ Advanced Features

Custom Fields with Defaults

from shelfie import Field, DateField, TimestampField

# Field with a default value
shelf = Shelf(
    root="./data",
    fields=[
        "experiment",
        Field("environment", default="production"),  # Always "production" unless specified
        DateField("date"),                          # Auto-generates today's date
        TimestampField("timestamp")                 # Auto-generates current timestamp
    ],
    attributes=["version"]
)

# Only need to specify non-default fields
record = shelf.create(
    experiment="test_1",
    version="1.0"
)
# Creates: ./data/test_1/production/2025-06-12/2025-06-12_14-30-45/
# Metadata includes: version_path__ for any attached files

Multiple File Types

# Attach different file types
record.attach(results_df, "results.csv")           # CSV
record.attach(model_config, "config.json")         # JSON  
record.attach(trained_model, "model.pkl")          # Pickle
record.attach(report_text, "summary.txt")          # Text

Loading Existing Shelves

# Load a shelf that was created elsewhere
existing_shelf = Shelf.load_from_root("./experiments")

# Continue adding to it
new_experiment = existing_shelf.create(
    experiment="advanced",
    model="transformer",
    epochs=50,
    learning_rate=0.0001  # Lower learning rate for transformer models
)

πŸ—‚οΈ Directory Structure Examples

Before Shelfie

my_project/
β”œβ”€β”€ experiment1_mlp_results.csv
β”œβ”€β”€ experiment1_mlp_model.pkl  
β”œβ”€β”€ experiment2_cnn_results.csv
β”œβ”€β”€ experiment2_cnn_model.pkl
β”œβ”€β”€ baseline_test_data.csv
β”œβ”€β”€ advanced_test_data.csv
└── notes.txt  # Which file belongs to what?

After Shelfie

my_project/
β”œβ”€β”€ baseline/
β”‚   β”œβ”€β”€ mlp/
β”‚   β”‚   └── 2025-06-12/
β”‚   β”‚       β”œβ”€β”€ metadata.json      # {"epochs": 100, "lr": 0.001, "results_path__": "/path/results.csv"}
β”‚   β”‚       β”œβ”€β”€ results.csv
β”‚   β”‚       └── model.pkl
β”‚   └── cnn/
β”‚       └── 2025-06-12/
β”‚           β”œβ”€β”€ metadata.json      # {"epochs": 200, "lr": 0.0001, "results_path__": "/path/results.csv"}
β”‚           β”œβ”€β”€ results.csv
β”‚           └── model.pkl
└── advanced/
    └── transformer/
        └── 2025-06-12/
            β”œβ”€β”€ metadata.json      # {"epochs": 50, "lr": 0.0001, "results_path__": "/path/results.csv"}
            β”œβ”€β”€ results.csv
            └── model.pkl

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“„ License

This project is licensed under the MIT License.


Happy organizing! πŸ“šβœ¨

About

Easy to use python package for organized hierarchical filestorage

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages