Skip to content

[SPARK-55793][CORE] Add multiple log directories support to SHS#54575

Open
sarutak wants to merge 6 commits intoapache:masterfrom
sarutak:shs-multi-log-dirs
Open

[SPARK-55793][CORE] Add multiple log directories support to SHS#54575
sarutak wants to merge 6 commits intoapache:masterfrom
sarutak:shs-multi-log-dirs

Conversation

@sarutak
Copy link
Member

@sarutak sarutak commented Mar 2, 2026

What changes were proposed in this pull request?

This PR proposes to add multiple log directories support to SHS, allowing it to monitor event logs from multiple directories simultaneously.

This PR extends spark.history.fs.logDirectory to accept a comma-separated list of directories (e.g., hdfs:///logs/prod,s3a://bucket/logs/staging). Directories can be on the same or different filesystems. Also, a new optional config spark.history.fs.logDirectory.names is added which allows users to assign display names to directories by position (e.g., Production,Staging). Empty entries fall back to the full path. Duplicate display names are rejected at startup.

Behavior of existing spark.history.fs.* settings with multiple directories:

All existing settings apply globally — there are no per-directory configurations.

Setting Behavior
update.interval One scan cycle covers all directories sequentially
cleaner.interval One cleaner cycle operates on the unified listing across all directories
cleaner.maxAge Applied to each log entry regardless of which directory it belongs to
cleaner.maxNum Total count across all directories; oldest entries are removed first regardless of directory
numReplayThreads Thread pool is shared across all directories
numCompactThreads Thread pool is shared across all directories
eventLog.rolling.maxFilesToRetain Applied per-directory independently
update.batchSize Applied per-directory independently

Regarding UI changes, a "Log Source" column is added to the History UI table showing the display name (or full path) for each application, with a tooltip showing the full path.

Regarding UI changes, A "Log Source" column is added to the History UI table showing the display name (or full path) for each application, with a tooltip showing the full path.
all-log-dirs

Users can filter applications by their log directory using Filter by Log Source dropdown.
filter-by-log-dir

The Event log directory section in the History UI collapses into a <details>/<summary> element when multiple directories are configured.
unexpand
expand

Why are the changes needed?

Some organization run multiple clusters and have corresponding log directory for each cluster. So if SHS supports multiple log directories, it can be used as a single end point to view event logs, which helps such organizations.

Does this PR introduce any user-facing change?

Yes but will not affect existing users.

How was this patch tested?

Manually confirmed WebUI as screenshots above and added new tests.

Was this patch authored or co-authored using generative AI tooling?

Kiro CLI / Opus 4.6

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @sarutak .

I understand prod and staging use cases.

What happens when there exists a conflict among the log directories? For example, a user want to abuse this as a kind of multi-tier log managements like the following and copy from shorterm to longterm? Of course, the sync operation is non-atomic.

  • hdfs://spark-events/shorterm
  • hdfs://spark-events/longterm

What is the semantic on the ordering in the config value? Especially, when we have SPARK-52914 ?

@dongjoon-hyun
Copy link
Member

Could you fix the CI failures?

[info] *** 24 TESTS FAILED ***
[error] Failed: Total 4431, Failed 24, Errors 0, Passed 4407, Ignored 28, Canceled 6
[error] Failed tests:
[error] 	org.apache.spark.deploy.history.RocksDBBackendHistoryServerSuite
[error] 	org.apache.spark.deploy.history.LevelDBBackendHistoryServerSuite
[error] (core / Test / test) sbt.TestsFailedException: Tests unsuccessful

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-55793][WEBUI] Add multiple log directories support to SHS [SPARK-55793][CORE] Add multiple log directories support to SHS Mar 2, 2026
@sarutak
Copy link
Member Author

sarutak commented Mar 2, 2026

@dongjoon-hyun Thank you for your interest.

What happens when there exists a conflict among the log directories? For example, a user want to abuse this as a kind of multi-tier log managements like the following and copy from shorterm to longterm? Of course, the sync operation is non-atomic.

hdfs://spark-events/shorterm
hdfs://spark-events/longterm

Each event log file is tracked by its full path as the key in LogInfo. So if the same application's event log exists in both directories, they are treated as separate entries.
I didn't anticipated such kind of usage but during a non-atomic copy, the incomplete log file in the destination directory may fail to parse or show incomplete information temporarily. However, on the next scan cycle, shouldReloadLog invoked through checkForLogs detects the file size change and re-parses it, so the entry self-corrects once the copy completes.

What is the semantic on the ordering in the config value? Especially, when we have SPARK-52914 ?

The ordering of directories in the config value has no semantic. All directories are scanned equally in each polling cycle (checkForLogs iterates over all logDirs). The order does not affect priority.

On-demand loading operates per log file within checkForLogsInDir, which is called independently for each directory. There is no cross-directory interaction, so I believe multiple directories support and on-demand loading are orthogonal and work together without issues.

@dongjoon-hyun
Copy link
Member

Thank you. This is a nice feature. I'll try to test more seriously.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need more clear definition between this and the existing spark.history.fs.* configuration. At the first glance,

  • Do you want to have per-directory configurations in the future?
  • For now, spark.history.fs.update.interval is supposed to be applied for one scan for all directories?
  • spark.history.fs.cleaner.interval is also supposed to be applied for one scan for all directories?
  • When spark.history.fs.cleaner.maxNum is applied,
    • This PR will consider the total number of files for all directories, right?
    • Which directory will be selected as a victim for the tie?

Since this introduces lots of ambiguity a little, could you revise the PR title and provide a corresponding documentation update, docs, together in this PR?

@sarutak
Copy link
Member Author

sarutak commented Mar 4, 2026

@dongjoon-hyun Thank you for your feedback.

Do you want to have per-directory configurations in the future?

I considered it might be helpful to have per-directory configurations (e.g. spark.history.fs.cleaner.*) but this such configurations are not supported at least in this PR, and I'd like to start with simple global settings and improve based on user feedback.

For now, spark.history.fs.update.interval is supposed to be applied for one scan for all directories?

Yes.

spark.history.fs.cleaner.interval is also supposed to be applied for one scan for all directories?

Yes.

When spark.history.fs.cleaner.maxNum is applied,
This PR will consider the total number of files for all directories, right?
Which directory will be selected as a victim for the tie?

Yes, the property is applied to the total number of log entries across all directories. As the updated document says, when the limit is exceeded, the oldest completed attempts are deleted first regardless of which directory they belong to.

Since this introduces lots of ambiguity a little, could you revise the PR title and provide a corresponding documentation update, docs, together in this PR?

Updated (You said revise the PR title but I thought it's type for PR description so I've updated only the description).

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I supported @sarutak 's proposal and this PR's approach. Thank you.

cc @mridulm , @yaooqinn , @LuciferYang , too.

-->

<script id="history-summary-template" type="text/html">
<div class="row" style="margin-bottom: 10px;">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline styles should use Bootstrap 5 utility classes (per SPARK-55775 which was recently merged):

Suggested change
<div class="row" style="margin-bottom: 10px;">
<div class="row mb-2">

Similarly below:

  • style="margin-right: 10px;"class="me-2"
  • style="display: inline-block; width: auto;"class="d-inline-block w-auto"

App Name
</span>
</th>
<th>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses the old Bootstrap 4 attributes (data-toggle, data-placement). After the BS5 migration (SPARK-55753), these should be:

Suggested change
<th>
<span data-bs-toggle="tooltip" title="Log directory where this application's event log is stored.">

Note: data-bs-placement="top" is not needed since BS5 defaults to top (SPARK-55778).

@@ -16,6 +16,14 @@
-->

<script id="history-summary-template" type="text/html">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table class should include table-hover for row highlight on mouseover, consistent with SPARK-55784 which adds table-hover to all Spark UI tables.

{dirs.map(d => <li>{d}</li>)}
</ul>
</details>
</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the collapsible log directories section, please make sure to follow the BS5 Collapse API pattern established in SPARK-55773 (data-bs-toggle="collapse", data-bs-target, aria-expanded, etc.) rather than custom JS toggle logic.

@sarutak
Copy link
Member Author

sarutak commented Mar 4, 2026

Thanks @yaooqinn for your feedback. I've updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants