BR-HPC-Log

Introduction

This repository supports the BR-HPC-Log Project, which aims to promote experimentation and development of machine learning-based approaches and algorithms for job failure prediction, job type prediction, job runtime prediction, and resource usage prediction, among others, in the context of job scheduling in High-Performance Computing (HPC) clusters.

Motivation

Energy companies rely extensively on High-Performance Computing (HPC) clusters to execute Petroleum Reservoir Simulation jobs, aiming to optimize the extraction of valuable hydrocarbons. However, ensuring efficient utilization of these supercomputers is paramount from power, financial, and operational perspectives. Integrating machine learning (ML) algorithms into job scheduling can improve resource utilization and decrease waiting times in queue, for instance.

BR-HPC-Log is an anonymized dataset of petroleum reservoir simulation jobs sourced from supercomputer SLURM logs, encompassing the daily HPC activities of more than 300 engineers at Petrobras over four years, covering around 7.8 million records. This dataset can be used to develop ML-based job execution time predictors, among other possibilities.

Strategy

This open project encloses a central resource: the BR-HPC-Log Dataset, which will be evolved and supplemented with additional instances occasionally.

Our strategy is to make this resource publicly available so that we can develop this project with the worldwide community in a collaborative manner.

Ambition

Through this project, Petrobras intends to develop (fix, improve, supplement, etc.) and/or foment:

Job scheduling and resource allocation strategies by predicting job outcomes and optimizing resource usage;
Identification of inefficiencies and anomalies in job execution, ensuring better resource management and quicker issue resolution;
Automated decision-making processes by clustering job patterns and forecasting future resource demands, facilitating proactive system management.

Contributions

We expect to receive diverse contributions from individuals, research institutions, startups, companies, and oil operator partners.

Before you can contribute to this project, we require you to read and agree to the following documents:

It is also essential to know, participate, and follow the discussions. Click on the Discussions link that appears in the top menu.

License

All the code (Python files in the technical validation directory) and dataset (compressed JSON-minified files in the dataset directory) of this project are licensed under the Apache 2.0 License.

BR-HPC-Log Dataset

Dataset Creation

Data Acquisition

The data was acquired through a query submitted to a private server hosted by Petrobras, as illustrated in the figure below.

The server stores the records of all jobs submitted via SLURM, and the query consists of filtering and sorting criteria.

During the filtering, only petroleum reservoir simulation jobs whose @submit values were greater than or equal to "2021-01-01T00:00:00.000Z" and less than or equal to "2024-12-31T23:59:59.999Z" were selected.

Next, the filtered data was sorted in ascending order based on the @submit field.

Finally, the sorted data was stored in a hashmap-structured file containing key-value pairs for the job records, one record per line, where each key-value pair holds a data field and its respective value.

Data Anonymization

The original values of some data fields had to be anonymized to comply with the data protection policies followed by Petrobras, as illustrated in the figure below.

The script field contained unstructured sensitive data, of which the most relevant was the name and version of the proprietary software used by the petroleum reservoir simulations, when available. Therefore, a pre-processing method using regular expressions obtained this information before running the anonymization process.
The total_cpus field, originally numeric, was transformed into a new categorical field named total_cpus_class, where each distinct CPU count value was mapped to a class. In addition, correlated fields to total_cpus were not included in the current version of the dataset.

Technical Validation

Ensuring integrity, accuracy, and reliability is fundamental to ensuring valid and reproducible results in any data-driven analysis or machine learning modeling. In this sense, the following validation procedures were applied to the dataset (technical validation directory):

Consistency and integrity: Regarding the completeness aspect, most fields have few missing values, and those with large missing counts are not typically pertinent for analyses (statistics_of_all_fields.py, top_x_occurrences_of_nominal_fields.py). Regarding consistency, the relationship between different fields was examined, for instance, if a @start date value was later than a @end date value (consistency_of_fields.py). Since SLURM managed all jobs and stored their execution history, no contradictions or conflicting values were found in the dataset;
Format and structure: Each line (or data record) of the dataset files was validated against a predefined schema based on the JSON Schema Draft-07 to ensure that the data followed the correct format (i.e., integer, floating-point, date, string) and the expected structure (validation_of_predefined_schema.py). Missing values were allowed for the data fields with missing occurrences. Moreover, no additional data fields besides the expected ones (Data Fields) could be present. Therefore, the data values are guaranteed to be expressed consistently and in a standard manner;
Relevance and currentness: The dataset includes key data fields concerning cluster usage and the scheduling of jobs, such as queue_wait and work_dir. Moreover, recent studies (Publications) successfully employed its features to predict the execution time of petroleum reservoir simulation jobs, demonstrating its relevance for similar objectives. Concerning currentness, no significant shifts in the underlying domain of petroleum reservoir simulation jobs have made the dataset outdated. Additionally, it was updated through December 2024, providing up-to-date information and reflecting the most recent job execution behavior patterns;
Granularity: The dataset provides time granularity of up to seconds between data records, allowing, for instance, varied time series analysis. Moreover, most of the nominal and numerical fields are detailed enough for different goals, such as summarizing data to understand its main characteristics (Descriptive Analytics), identifying causes or reasons behind certain events or outcomes (Diagnostic Analytics), and using historical data to make predictions about future events or trends (Predictive Analytics);
Balance: The data balancing analysis was performed for elapsed and queue_wait, the numerical fields with the highest standard deviation values. Two visual methods were used for each field: Kernel Density Estimate (KDE) and Quantile-Quantile (Q-Q) (data_distribution_of_numeric_fields.py). The analysis indicated that both fields require the use of balancing procedures to improve their predictive efficiency;
Provenance: The Dataset Creation section includes details about the origin of the data (although public access is not achievable), provides details about how it was collected, such as the data period, and presents the transformations made to the raw data. Moreover, known issues are presented, such as missing data, which helps potential users evaluate its reliability for specific analyses. Finally, this repository is accompanied by a metadata documentation summary, providing the overall understanding of the data's context, limitations, and usage opportunities;
Accessibility and transparency: The dataset is hosted in an open and accessible platform (GitHub) and is published in a commonly used compressed format (.xz). Its documentation provides an open data license (Apache License 2.0) and includes a detailed description of the data and metadata, usage guidelines, and limitations. These aspects ensure the dataset is publicly accessible and well-documented, helping potential users to understand and operate it straightforwardly.

Dataset Summary

The dataset holds 7,787,294 records, of which 640,252 (8.22%) are from 2021, 1,882,584 (24.18%) are from 2022, 2,531,182 (32.5%) are from 2023, and 2,733,276 (35.1%) are from 2024. Each data record contains up to 39 data fields, whose type and description are summarized in the Data Fields subsection.

Supported Tasks

The potential machine learning tasks supported by the BR-HPC-Log dataset (but not limited to) are:

Classification - Predicting a category or class label based on job-related data, for instance:
- Job failure prediction: Given the job features like resource usage, job duration, and job type, you could classify jobs as "successful" or "failed";
- Job type prediction: Predicting the type of job (e.g., "batch" or "interactive") based on resource consumption or user characteristics.
Regression - Predicting a continuous variable based on job-related data, for instance:
- Predicting job runtime: Based on historical logs, you could build a regression model to predict how long a job will take based on features like the number of nodes, memory allocated, and job type;
- Predicting resource usage (e.g., memory or CPU): Predicting how much memory or CPU a job will consume, which could be useful for scheduling purposes.
Anomaly Detection - Identifying unusual patterns or anomalies (rare observations) in the dataset, for instance:
- Resource usage anomalies: Detecting jobs that consume an unusually high amount of resources (CPU, memory, etc.) compared to others;
- Unusual execution times: Identifying jobs that take abnormally long or short durations compared to the expected range.
Clustering - Grouping jobs into similar clusters based on their characteristics, for instance:
- Job grouping: Group jobs based on usage patterns (e.g., high memory, long runtime), which could help optimize resource allocation;
- Cluster users by job behavior: Identifying user behavior patterns (e.g., frequent short vs. occasional long jobs) to understand resource usage trends.
Time Series Forecasting - Predicting future events or trends based on temporal data, for instance:
- Resource demand forecasting: Predicting the future resource demand (CPU, memory) based on past job submissions and resource usage patterns;
- Job arrival prediction: Predicting when jobs are likely to be submitted based on historical submission times.
Feature Engineering / Dimensionality Reduction - Reducing the number of features while retaining relevant information, for instance:
- PCA (Principal Component Analysis) on resource usage: Reducing the complexity of job resource usage features to improve performance and interpretability of models;
- Textual analysis of job descriptions: If the logs contain job descriptions or command lines (strings), techniques like TF-IDF or embeddings can be applied to extract key features and reduce the dimensionality.
Job Scheduling Optimization - Optimizing job scheduling decisions to improve efficiency or minimize resource wastage, for instance:
- Resource allocation optimization: Building a system to predict the best job scheduling strategy, given historical SLURM log data on job resource usage and completion times;
- Priority-based scheduling: Using the job types and histories to prioritize high-demand jobs (e.g., jobs requiring high memory).

Languages

English

Dataset Structure

Data Instances

A sample (last record) from the dataset is provided below:

{
    'jobid': 7619236,
    'username': 'osi8',
    'user_id': 6906,
    'groupname': 'bigemina',
    'group_id': 3817,
    '@start': '2024-12-31T23:59:56',
    '@end': '2025-01-01T04:19:19',
    'elapsed': 15563,
    'partition': 'ijo,ash',
    'alloc_node': 'mes3do4',
    'nodes': 'mes1o08c30',
    'total_cpus_class': 0,
    'total_nodes': 1,
    'derived_ec': '0:0',
    'exit_code': '0:0',
    'state': 'COMPLETED',
    'pack_job_id': 0,
    'pack_job_offset': 0,
    'het_job_id': 0,
    'het_job_offset': 0,
    '@submit': '2024-12-31T23:59:56',
    '@eligible': '2024-12-31T23:59:56',
    'queue_wait': 0,
    'work_dir': '/bxs_mungy_ug/ram/yogees/probers/xc/xc02/xx_drossy/la_mime/nef_gae/gae_drossy_staved/amp/ice/d0/d1/sibyllism103r_ice_0/sibyllism103r_020',
    'std_err': '/bxs_mungy_ug/ram/yogees/probers/xc/xc02/xx_drossy/la_mime/nef_gae/gae_drossy_staved/amp/ice/d0/d1/sibyllism103r_ice_0/sibyllism103r_020/unb.rel',
    'std_in': '/axe/chug',
    'std_out': '/bxs_mungy_ug/ram/yogees/probers/xc/xc02/xx_drossy/la_mime/nef_gae/gae_drossy_staved/amp/ice/d0/d1/sibyllism103r_ice_0/sibyllism103r_020/unb.qid',
    'cluster': 'unswatheable',
    'qos': 'erg_enacture',
    'time_limit': 43200,
    'job_name': 'sibyllism103r_020',
    'account': 'auk-gae-est',
    'script': 'nor 2022.10',
    'parent_accounts': '/obes/auk-gae-est/auk-gae-est'
}

Data Fields

Nominal:
- account: Account associated with the job;
- alloc_node: Nodes allowed to execute the job;
- cluster: Name of the cluster in which the job was executed;
- derived_ec: Highest exit code returned by the job;
- excluded_nodes: Nodes forbidden to execute the job;
- exit_code: Exit code returned by the job;
- groupname: Group name associated with the job;
- job_name: Name of the job;
- nodes: Nodes that executed the job;
- orig_dependency: Dependencies that had to be satisfied before the job execution;
- parent_accounts: Hierarchy (parents) associated with the job's account;
- partition: Partitions (queues) in which the job was submitted for;
- qos: Scheduling priority defined for the job execution;
- reservation_name: Name associated with the nodes reserved for the job;
- script: Batch script submitted for the job. May contain the name and version of the simulator;
- state: State of the job in extended form;
- std_err: Path for which the standard error (stderr) stream outputted to;
- std_in: Path from which the standard input (stdin) stream was read;
- std_out: Path for which the standard output (stdout) stream outputted to;
- total_cpus_class: Class associated with the number of CPU cores allocated for the job;
- username: User who submitted the job;
- work_dir: Working directory (path) of the batch script executed by the job.
Ordinal:
- @eligible: Timestamp in which the job was eligible to run;
- @end: Timestamp in which the job finished;
- @start: Timestamp in which the job started;
- @submit: Timestamp in which the job was submitted;
- array_job_id: Identifier for the common-parameters jobs collection (job array);
- array_task_id: Identifier for the tasks of the job array;
- group_id: Unique identifier for the group associated with the job;
- het_job_id: Leader ID of the co-scheduled jobs;
- het_job_offset: Unique sequence number for each of the co-scheduled jobs, zero origin;
- jobid: Identifier for the job;
- pack_job_id: Common identification for the co-scheduled jobs;
- pack_job_offset: Unique sequence number for each of the co-scheduled jobs, zero origin;
- user_id: Unique identifier of the user who submitted the job.
Discrete:
- elapsed: Job's wall time in seconds;
- queue_wait: Time in seconds in which the job was queued until started;
- time_limit: Timelimit in seconds for the job execution;
- total_nodes: Number of nodes allocated for the job.

How to Extract the Compressed Dataset Files?

Four compressed dataset files are available: br-hpc-log_2021.xz (14.7 MB), br-hpc-log_2022.xz (54.0 MB), br-hpc-log_2023.xz (71.5 MB), and br-hpc-log_2024.xz (77.8 MB), corresponding to each year.

The files were compressed using XZ Utils, a general-purpose data-compression library based on the LZMA compression algorithm. After decompression, the files will have 645 MB, 1.95 GB, 2.66 GB, and 2.88 GB in size, respectively, totaling 8.13 GB.

Extracting on Debian/Ubuntu Linux:

  sudo apt install xz-utils

  xz -dk COMPRESSED_DATASET_FILE

Replace COMPRESSED_DATASET_FILE by one of the following files: br-hpc-log_2021.xz, br-hpc-log_2022.xz, br-hpc-log_2023.xz, or br-hpc-log_2024.xz.

How to Read the Dataset Files?

There are several ways to read a JSON-minified file, each row being a separate key-value pair structure. Below, a line-by-line approach is provided for Python, which is useful when handling large files or when memory is constrained.

Line-by-line parsing with Python:

from ast import literal_eval
dataset_file = "PATH_TO_DATASET_FILE"
jobs = []
with open(file=dataset_file, mode="r", encoding="utf-8") as file:
    for job_line in file:
        job_line = job_line.replace("null", "None").replace("nan", "None").replace("\\N", "None")
        job_dict = literal_eval(job_line)
        jobs.append(job_dict)
# Manipulate the list of jobs as you wish...
# ...
exit(0)

How to Run the Technical Validation Scripts?

Before using the code available in this repository, specifically for technical validation of the dataset, we recommend you set up a Python Virtual Environment with the package versions listed in the requirements.txt file.

  python3 SCRIPT_FILE

Replace SCRIPT_FILE by one of the following files: consistency_of_fields.py, data_distribution_of_numeric_fields.py, statistics_of_all_fields.py, top_x_occurrences_of_nominal_fields.py, or validation_of_predefined_schema.py.

Publications

The BR-HPC-Log dataset usage contributed to the publication of the following scientific papers:

Nunes, A. L. et al. Prediction of Reservoir Simulation Jobs Times Using a Real-World SLURM Log. In Anais do XXIV Simpósio em Sistemas Computacionais de Alto Desempenho (SSCAD), 49–60, https://doi.org/10.5753/wscad.2023.235649 (2023);
Nunes, A. L. et al. A Framework for Executing Long Simulation Jobs Cheaply in the Cloud. In 2024 IEEE International Conference on Cloud Engineering (IC2E), 233–244, https://doi.org/10.1109/IC2E61754.2024.00033 (2024);
Lima, M. et al. Modelos de Predição do Tempo de Jobs Aplicados a um Ambiente de Produção de Alto Desempenho. In Anais do XXV Simpósio em Sistemas Computacionais de Alto Desempenho (SSCAD), 25–36, https://doi.org/10.5753/sscad.2024.244537 (2024);
Nunes, A. L. et al. Two-Step Estimation Strategy for Predicting Petroleum Reservoir Simulation Jobs Runtime on an HPC Cluster. Concurrency and Computation: Practice and Experience 37, e70026, https://doi.org/10.1002/cpe.70026 (2025).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
clas		clas
dataset		dataset
images		images
technical_validation		technical_validation
.gitignore		.gitignore
BACKLOG.md		BACKLOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTOR_LICENSE_AGREEMENT.md		CONTRIBUTOR_LICENSE_AGREEMENT.md
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BR-HPC-Log

Table of Contents

Introduction

Motivation

Strategy

Ambition

Contributions

License

BR-HPC-Log Dataset

Dataset Creation

Data Acquisition

Data Anonymization

Technical Validation

Dataset Summary

Supported Tasks

Languages

Dataset Structure

Data Instances

Data Fields

How to Extract the Compressed Dataset Files?

Extracting on Debian/Ubuntu Linux:

How to Read the Dataset Files?

Line-by-line parsing with Python:

How to Run the Technical Validation Scripts?

Publications

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

petrobras/br-hpc-log

Folders and files

Latest commit

History

Repository files navigation

BR-HPC-Log

Table of Contents

Introduction

Motivation

Strategy

Ambition

Contributions

License

BR-HPC-Log Dataset

Dataset Creation

Data Acquisition

Data Anonymization

Technical Validation

Dataset Summary

Supported Tasks

Languages

Dataset Structure

Data Instances

Data Fields

How to Extract the Compressed Dataset Files?

Extracting on Debian/Ubuntu Linux:

How to Read the Dataset Files?

Line-by-line parsing with Python:

How to Run the Technical Validation Scripts?

Publications

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages