SciGlass+ is an enhanced SciGlass database constructing by large language model (LLM) and manual cross-validation. Due to the large size of the database files, you can download the original database file on Figshare. All script files (.py, .ipynb) can be found in this GitHub repository, while all data files (.json, .csv, .xlsx), due to size limitations, can be downloaded from Figshare.This README is organized into three main sections:
- Database Construction Tutorial – Step-by-step guide on how the glass database was automatically built using LLM models.
- Application Demo – Example notebooks to demonstrate how to load the database, filter glasses based on specific compositions and properties, and use machine learning models to predict glass properties.
- API Reference – Python functions to access and manipulate the SciGlass+ database.
We used LLM models to construct the glass database. The tutorial consists of six steps:
- Log in to Web of Science.
- Select Advanced Search.
- Enter keywords in the Query Preview field (based on SciGlass, including glass systems, compositions, and properties of interest)
keywords.txt. - Set the Data Range (custom: 2019-01-01 to 2025-01-01) and click Search.
- Refine results: select Article under "Document Types" and English under "Languages".
- Export metadata (Excel format) including title, authors, journal, year, DOI, and abstract. Merge multiple exports into
savedrecs-total.xls.
- Copy the DOI and Abstract columns to a new sheet for the file
savedrecs-total.xls. Label articles of interest as1and irrelevant ones as0. This helps filter unrelated articles using NLP models (e.g., BERT). - Prepare datasets using
./src/mkAbsClsDataset-manual.ipynbto generate training data (sample.json) and inference data (abstr.json). - Train and predict on the server with GPU acceleration:
- Train model:
python ./src/abstract_classification.py - Predict classifications:
python ./src/pred.py→pred.json.
- Train model:
- Use
./src/SelectDoi-manual.ipynbto get DOIs of articles to download. - Download articles (It requires you to have subscribed to the relevant publishers or journals):
- Elsevier XML:
python ./src/elsevier_xml_download.py - Springer/MDPI HTML:
python ./src/html_download.py - Wiley HTML:
python ./src/html_download_selenium.py(requires Chrome)
- Elsevier XML:
- Extract paragraph and table data:
- Elsevier XML:
python ./src/xml_parse_struct_table.py→data_strcut_els_20241010.json - Wiley/MDPI HTML:
python ./src/html_paser_struct_table_mdpi_wiley.py→wiley_data_strcut_20241010.json/mdpi_data_strcut_20241010.json - Springer HTML:
python ./src/html_download_table_springer.py+python ./src/html_parse_struct_table_springer.py→springer_data_strcut_20241010.json.
- Elsevier XML:
- Paragraphs filtered by section titles; tables filtered using BERT-trained classifiers.
- Use
./src/mkTableClsDatSet.ipynbto label sample tables and generate training/validation datasets →sample_tables.json,val_tables.json. - Train table classifier on server:
python ./src/tables_cls.py, then validate withpython ./src/table_val_pred.py→pred_val_tables.json. - Apply classifiers to filter paragraphs/tables:
./src/ParseParaTable-html-xml-sel-tables.ipynb→data_xml.json,data_html.json.
- Extract data from filtered paragraphs/tables using
python ./src/gpt_extract_method_glass.py→result_20241010.json,result_html_20241013.json. - Convert JSON to Excel and merge batches:
./src/jsonR2Excel&metaD-manual.ipynb→glass-rawdata-2019-2025.xlsx. - Manual cross-validation is used to improve data quality.
- Download PDFs of articles to verify extracted data:
./src/checkPDFlist.ipynb→ DOI listsels_pdf.txt,other_pdf.txt.python ./src/elsevier_pdf_download_custom.py/python ./src/pdf_download.py→ download PDF files intopdf_listfolder (need your API key and subscribe to the relevant publishers or journals)
- Cross-check each row in
glass-rawdata-2019-2025.xlsxwith the corresponding PDF. - After 12 volunteers working 4 hours/day for 10 months (total cost ~40,000 RMB), the verified database
SciGlass_Plus_properties.xlsxwas produced.
Create a new conda environment with Python 3.11.4:
conda create -n complexity python=3.11.4Activate the environment:
conda activate your_env_nameInstall the required dependencies:
pip install -r requirements.txt- Loading the database filtering glasses by compositions (
./composition.ipynb) and properties (./property.ipynb). - Using machine learning models to predict glass properties (
./predictor.ipynb).
pip install SciGlassPlusfrom SciGlassPlus.load import SGP, available_columns
columnNames = available_columns()Returns a dictionary with keys: "Elements", "Compounds", "Properties", "Metadata". Each key contains a list of available items.
df_all = SGP()elements_cfg = {"drop": ["Ca"], "keep": ["Si"]}
compounds_cfg = {"drop": ["CaO", "TiO2"], "keep": ["SiO2", "Al2O3"]}
properties_cfg = {"drop": ["T1", "T2"], "keep": ["Tg"]}
metadata_cfg = {"drop": ["Institutions"], "keep": ["Doi"]}
df_filtered = SGP(
elements_cfg=elements_cfg,
compounds_cfg=compounds_cfg,
properties_cfg=properties_cfg,
metadata_cfg=metadata_cfg
)"drop": columns removed or entries with non-zero values filtered out (for elements and compounds)"keep": only entries with defined values are kept
For questions about the database or contributions, please contact: SciGlassPlus@163.com
Based on the definitions of FAIR principles, the SciGlass+ is assessed against each FAIR component.
Regarding Findability, each entry in SciGlass+ is assigned a persistent, globally unique identifier (SGP-#sample, e.g. SGP-26), (F1). All entries are described with rich metadata, including authorship, abstracts, keywords, affiliations, contact information, funding sources, citation metrics, publishing details, and descriptive narratives on background, processing methodologies, and underlying mechanisms (F2). Metadata explicitly references the identifiers of the corresponding entries through DOI (F3), ensuring clear traceability. Furthermore, the dataset is hosted on a GitHub repository and corresponding code, allowing users to locate specific entries efficiently (F4).
In terms of Accessibility, all data and scripts are openly downloadable via standard HTTP(S) protocols, which are open, free, and universally implementable (A1, A1.1). Where necessary, the protocol supports authentication and authorization procedures (e.g., via HTTPS and token-based access), thereby fulfilling A1.2. Users can access and retrieve both metadata and data without restrictions, and metadata will remain accessible even if the underlying dataset is updated or temporarily unavailable. This is ensured by the assignment of persistent identifiers (DOIs) via Figshare or Zenodo, which guarantees long-term accessibility of metadata independently of the dataset files (A2).
For Interoperability, we have proposed the ULG to provide a formal, structured schema for representing composition, processing, property, and metadata (I1). Controlled vocabularies are applied consistently across the dataset, including IUPAC-compliant chemical symbols as well as standardized terms for glass composition (I2). Qualified references to external sources are maintained through persistent identifiers (DOIs) linking to relevant literature, ensuring that the dataset is properly contextualized and connected to existing knowledge bases (I3).
Concerning Reusability, all metadata are richly described with relevant attributes including authorship, abstracts, keywords, affiliations, contact information, funding sources, citation metrics, publishing details, and descriptive narratives on background, processing methodologies, and underlying mechanisms. (R1). Data are released under an open CC-BY 4.0 license, ensuring clear terms for reuse (R1.1). Detailed provenance information is included, describing sources, extraction methods, and validation steps (R1.2). Units, nomenclature, and field definitions are standardized to support consistent reuse across different research applications (R1.3).
The current dataset and accompanying scripts provide substantial FAIR-compliant capabilities, enabling reproducible, transparent, and programmatic access for computational materials science and machine learning applications.