A data-driven study for Material–Structure–Property (MSP) mapping in organic photovoltaics. The repository includes a 25k-sample physics-informed dataset, a compact feature set that links material parameters with microstructure descriptors, and a random-forest surrogate model for predicting short-circuit current.
Install the required dependencies:
pip install -r requirements.txtmsp-opv/
├── data/
│ ├── data.parquet # Main dataset (25k samples)
│ ├── feature_map.yaml # Feature alias mapping
│ ├── sfs_linear.csv # Sequential feature selection results (linear)
│ └── sfs_rf.csv # Sequential feature selection results (RF)
├── scripts/ # Python scripts (executable)
│ ├── linear.py # Linear model (Lasso) with SFS
│ ├── rf.py # Random Forest with SFS
│ ├── model_with_confidence.py # RF model with confidence estimation
│ ├── pdp.py # Partial dependence plots
│ └── sfs_and_correlation_plot.py # SFS comparison and correlation plots
├── notebooks/ # Jupyter notebooks (same functionality as scripts)
│ ├── linear.ipynb
│ ├── rf.ipynb
│ ├── model_with_confidence.ipynb
│ ├── pdp.ipynb
│ └── sfs_and_correlation_plot.ipynb
├── figures/ # Generated plots and visualizations
├── requirements.txt # Python dependencies
└── README.md
Note: The notebooks in notebooks/ provide the same functionality as the scripts in scripts/ but are provided for convenience and interactive exploration.
- Implements a Lasso regression model with sequential feature selection (SFS)
- Performs cross-validated forward feature selection
- Generates:
sfs_linear.csv: Feature selection resultssfs_linear_curve.pdf: CV R² vs. number of featuressfs_linear_gains.pdf: Marginal gains by feature
- Implements a Random Forest regressor with sequential feature selection
- Uses cross-validation to evaluate feature importance
- Generates:
sfs_rf.csv: Feature selection resultssfs_rf_curve.pdf: CV R² vs. number of featuressfs_rf_gains.pdf: Marginal gains by feature
- Trains a Random Forest model with confidence estimation
- Implements two confidence metrics:
- Ensemble: Based on standard deviation across trees
- CI: Based on 5-95% confidence intervals
- Generates:
- KDE plots of confidence vs. absolute error
- Prediction plots with confidence bands
- Progressive training curves (R² vs. training fraction)
- Computes and visualizes partial dependence plots (PDPs)
- Analyzes feature interactions using 2D PDPs
- Focuses on key features:
Ld(exciton diffusion length),min(c1, c2)(carrier mobilities),STAT_e(interfacial area), andCT_f_e_conn - Generates contour plots showing partial dependence of short-circuit current (J) on feature pairs
- Compares SFS results between linear and RF models
- Generates correlation heatmaps of input features
- Creates side-by-side visualizations of:
- SFS performance curves
- Feature importance rankings
- Feature correlation matrices
All scripts should be run from the project root directory. They expect data files in ../data/ relative to the script location:
# From project root
cd scripts
python linear.py
python rf.py
python model_with_confidence.py
python pdp.py
python sfs_and_correlation_plot.pyThe scripts expect:
data/data.parquet: Main dataset with features and target variableJ(short-circuit current)data/feature_map.yaml: Mapping between feature aliases (c1, c2, c3, d1-d21) and full feature names
The analysis focuses on:
- Material parameters: Carrier mobilities (c1, c2), recombination (c3)
- Microstructure descriptors: Statistical descriptors (d1-d21)
- Derived features:
Ld: Exciton diffusion length (nm)min(c1, c2): Minimum of carrier mobilitiesSTAT_e: Normalized interfacial areaCT_f_e_conn: Charge transfer connectivity
All figures are saved to the figures/ directory in PDF and/or PNG format. CSV results are saved to data/.
numpy: Numerical operationspandas: Data manipulationmatplotlib: Plottingscikit-learn: Machine learning modelsPyYAML: YAML file parsingtqdm: Progress barspyarrow: Parquet file supportseaborn: Statistical visualizationsscipy: Scientific computing
See requirements.txt for the complete list.
See LICENSE file for details.