A Python toolkit for processing and managing wind farm SCADA data using DuckDB.
Wind Data Fabric provides utilities to automatically discover and consolidate wind farm data files (CSV, Parquet, JSON, Excel) into efficient DuckDB databases. The toolkit handles various data formats and automatically creates optimized tables with indexes and documentation.
- Multi-format Support: Automatically loads CSV, Parquet, JSON, and Excel files
- Dynamic Discovery: Scans folders and loads all compatible data files
- Smart Table Naming: Generates clean table names from file names
- Auto-indexing: Creates indexes on common columns (timestamp, channel_id, id, date, datetime)
- Path Escaping: Handles file paths with special characters safely
- Performance Tracking: Reports processing time and database statistics
- Cross-platform: Works on Windows, Linux, and macOS
Wind-Data-Fabric/
├── data/
│ ├── raw/ # Raw data files organized by folder
│ │ ├── Kelmarsh/ # Example: wind farm data folder
│ │ │ ├── scada_data.parquet
│ │ │ ├── channels_metadata.csv
│ │ │ └── system_modes.json
│ │ └── your_folder/ # Your data files here
│ └── processed/ # Generated DuckDB databases
│ └── your_folder.duckdb
│ └── your_folder_schema.yaml
│ └── your_folder_log.txt
├── scripts/
│ ├── create_duckdb.py # Main database creation script
│ └── utils/
│ └── file_readers.py # File reading utilities
├── create_duckdb.bat # Windows wrapper script
├── create_duckdb.sh # Linux/Mac wrapper script
└── README.md
- Python 3.8+
- DuckDB Python package
- Clone the repository:
git clone <repository-url>
cd Wind-Data-Fabric- Install dependencies:
pip install duckdb- For Excel file support (optional):
# DuckDB will auto-install the excel extension when neededPlace your data files in a folder under data/raw/, then create a DuckDB database:
Windows:
create_duckdb.bat your_folder_nameLinux/Mac:
chmod +x create_duckdb.sh # First time only
./create_duckdb.sh your_folder_name# From project root
python scripts/create_duckdb.py --folder your_folder_name
# With custom base path
python scripts/create_duckdb.py --folder your_folder_name --base-path /path/to/projectThe script automatically detects and loads:
- Parquet files (
.parquet): High-performance columnar format - CSV files (
.csv): Comma-separated values with auto-detection - JSON files (
.json): Structured JSON data - Excel files (
.xlsx,.xls): Spreadsheet data
- Prepare your data:
data/raw/MyWindFarm/
scada_data.parquet
SCADA_Channels_Metadata.csv
system_modes_mapping.json
turbine_specs.xlsx
- Run the script:
./create_duckdb.sh MyWindFarm- Output:
============================================================
DATABASE SUMMARY: MyWindFarm
============================================================
scada_data 1,234,567 rows 15 columns
scada_channels_metadata 45 rows 5 columns
system_modes_mapping 12 rows 3 columns
turbine_specs 8 rows 20 columns
Raw files size: 2797.77 MB
Database size: 552.76 MB
Total elapsed time: 6.78 seconds
============================================================
- Access the database:
import duckdb
con = duckdb.connect('data/processed/MyWindFarm.duckdb')
result = con.execute('SELECT * FROM scada_data LIMIT 10').fetchall()
con.close()The project includes reusable file reading utilities in scripts/utils/file_readers.py:
read_parquet_to_table(con, parquet_path, table_name): Load Parquet filesread_csv_to_table(con, csv_path, table_name, auto_detect=True, **kwargs): Load CSV filesread_json_to_table(con, json_path, table_name, auto_detect=True): Load JSON filesread_excel_to_table(con, excel_path, table_name, sheet=None): Load Excel files
import duckdb
from scripts.utils.file_readers import read_csv_to_table, read_parquet_to_table
con = duckdb.connect('mydata.duckdb')
# Load a CSV file
result = read_csv_to_table(con, 'data.csv', 'my_table')
print(f"Loaded {result['rows']} rows")
# Load a Parquet file
result = read_parquet_to_table(con, 'data.parquet', 'another_table')
print(f"Columns: {result['columns']}")
con.close()File names are converted to valid SQL table names:
SCADA_Channels_Metadata.csv→scada_channels_metadatasystem-modes-mapping.json→system_modes_mapping- Special characters and spaces are replaced with underscores
- All names are lowercase
Indexes are automatically created on commonly used columns:
timestamp: For time-series querieschannel_id: For channel-based filteringid: For primary key lookupsdate,datetime: For date-based queries
- Failed file loads don't stop the entire process
- Detailed error messages for troubleshooting
- Validation checks for empty tables
- Graceful handling of missing optional files
python scripts/create_duckdb.py --help
Options:
--folder FOLDER Folder name with data files (required)
--base-path PATH Base project path (default: auto-detect)The script uses Python's logging module with timestamps:
INFO: Progress updates and successful operationsWARNING: Missing files, empty tables, failed indexesERROR: Critical failures
Issue: "No data files found"
- Solution: Ensure files are in
data/raw/<folder_name>/ - Check file extensions (
.csv,.parquet,.json,.xlsx,.xls)
Issue: "Module not found: utils.file_readers"
- Solution: Run from project root, not from scripts folder
- Or use the provided batch/bash scripts
Issue: Excel files not loading
- Solution: DuckDB will auto-install the excel extension
- Ensure you have internet connection for first-time setup
Issue: Path with spaces causes errors
- Solution: The file readers handle this automatically with proper escaping
- Use Parquet: Fastest loading and best compression
- Index Strategy: Only index columns you'll query frequently
- Batch Processing: Process multiple wind farms in sequence
- Memory: Large files are streamed efficiently by DuckDB
Contributions are welcome! Please ensure:
- Code follows existing style
- File readers handle edge cases (special chars, empty files)
- Error messages are clear and actionable
MIT License
Copyright (c) 2025 Jorge A. Thomas-Meléndez
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Jorge A. Thomas-Meléndez
- Email: jorgethomasm@ieee.org
Built with:
- DuckDB: High-performance analytical database
- Python standard library: pathlib, logging, argparse