Skip to content

Conversation

@jasonosullivan34
Copy link

@jasonosullivan34 jasonosullivan34 commented Dec 5, 2025

What changes were proposed in this pull request?

HDDS-14084. updating metadata / data size check logs pipeline node selection to info

Please describe your PR in detail:
Updating log level for metadata and data size checks to info. These checks are done as part of the node selection for adding nodes to a pipeline

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14084

How was this patch tested?

workflow run on forked git repo

@jasonosullivan34 jasonosullivan34 changed the title HDDS-14084. updating metadata / data size check logs pipeline node selection to info HDDS-14084. updating metadata / data size check logs for pipeline node selection to info Dec 5, 2025
Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change makes sense. Once concern is that the logs are emitted too frequently, but for that to happen, there must be a lot of nodes on the cluster that are out of space on all disks. That points to a wider cluster issue that needs to be resolved via balancing, adding capacity or removing data. With the logs at debug level, the problem is hidden unless someone thinks to enable debug on this class, and system wide debug is very noisy!

@adoroszlai
Copy link
Contributor

Once concern is that the logs are emitted too frequently

This method is called for several reasons, one of them is collecting metrics, which happens every 30 seconds according to #9418 (though I guess it depends on Prometheus and other settings).

My other concern is that this provides little information, only datanode ID/address and requested space. Please see AvailableSpaceFilter for a better approach: upon checking each volume, it keeps track of ones that are full, which are then printed by toString().

@jasonosullivan34 jasonosullivan34 marked this pull request as draft December 8, 2025 16:28
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jasonosullivan34 for updating the patch.

if (!(node instanceof DatanodeInfo)) {
node = nodeManager.getDatanodeInfo(node);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avoid whitespace-only change

Suggested change

@@ -0,0 +1,90 @@
package org.apache.hadoop.hdds.scm;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

License header is needed in all files. Example:

/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* <p>
* http://www.apache.org/licenses/LICENSE-2.0
* <p>
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

Comment on lines 1387 to 1392
.filter(dn -> !hasEnoughSpace(dn, minRatisVolumeSizeBytes, containerSize, conf)
&& !hasEnoughCommittedVolumeSpace(dn, blockSize))
.count();
.filter((dn) -> {
SCMDatanodeCapacityInfo info = checkSpace(dn, minRatisVolumeSizeBytes, containerSize, conf);
return !info.hasEnoughSpace() && !hasEnoughCommittedVolumeSpace(dn, blockSize);
}).count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the original code in this file since it doesn't need the SCMDatanodeCapacityInfo.

@adoroszlai adoroszlai changed the title HDDS-14084. updating metadata / data size check logs for pipeline node selection to info HDDS-14084. Log details of ineligible nodes in SCMCommonPlacementPolicy#filterNodesWithSpace Dec 14, 2025
@adoroszlai
Copy link
Contributor

Thanks @jasonosullivan34 for updating the patch. Can you please check test timeout in org.apache.hadoop.ozone.container.placement.TestContainerPlacement?

@github-actions
Copy link

github-actions bot commented Jan 6, 2026

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

@github-actions github-actions bot added the stale label Jan 6, 2026
Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jasonosullivan34 for taking this up.

Do we really want to include volume level information here? Of course more information is better, but even with 20 full datanodes and 20 volumes per datanode, that'll be a lot of logging. Excessive logging costs write performance as well.

We need to think of a different way to surface the fact that many datanodes are full with volume level information. @priyeshkaratha is the capacity distribution project that you're working on doing something in this area?

I think this change is helpful but we should restrict it to Datanode stats (capacity, used, available, reserved, committed). Curious to hear what others think.

public String getInsufficientSpaceMessage() {
if (hasEnoughSpace()) {
return String.format("Datanode %s has sufficient space (data: %d bytes required, metadata: %d bytes required)",
datanodeDetails.getUuidString(), dataVolumeInfo.requiredSpace, metaVolumeInfo.requiredSpace);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Datanode ip address or hostname instead of UUID. Makes debugging easier.

@Override
public String toString() {
return "SCMDatanodeCapacityInfo{" +
"datanode=" + datanodeDetails.getUuidString() +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, use ip address or hostname.

}

public void addFullDataVolume(StorageReportProto report, long usableSpace) {
this.dataVolumeInfo.addFullVolume(new FullVolume(report.getStorageUuid(), usableSpace));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use volume path instead of storage uuid.

Comment on lines +94 to +102
if (!hasEnoughDataSpace()) {
return String.format("Datanode %s has no volumes with enough space to allocate %d bytes for data." +
" data=%s, metadata=%s", datanodeDetails.getUuidString(), dataVolumeInfo.requiredSpace, dataVolumeInfo,
metaVolumeInfo);
} else {
return String.format("Datanode %s has no volumes with enough space to allocate %d bytes for metadata." +
" data=%s, metadata=%s", datanodeDetails.getUuidString(), metaVolumeInfo.requiredSpace, dataVolumeInfo,
metaVolumeInfo);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said above, perhaps it's better to only include datanode level total space, used, available, reserved, committed etc. and exclude volumes.

@siddhantsangwan
Copy link
Contributor

I think this change makes sense. Once concern is that the logs are emitted too frequently, but for that to happen, there must be a lot of nodes on the cluster that are out of space on all disks. That points to a wider cluster issue that needs to be resolved via balancing, adding capacity or removing data. With the logs at debug level, the problem is hidden unless someone thinks to enable debug on this class, and system wide debug is very noisy!

Good point - but do you think volume stats should be logged here as well or total DN capacity, used space etc. is good enough? @sodonnel

@siddhantsangwan
Copy link
Contributor

My other concern is that this provides little information, only datanode ID/address and requested space. Please see AvailableSpaceFilter for a better approach: upon checking each volume, it keeps track of ones that are full, which are then printed by toString().

@adoroszlai yes the AvailableSpaceFilter approach is pretty good, I'm just wondering if the logs will be excessive if the volumes are printed as well. Let's see if @priyeshkaratha is trying to make this observable in some other way.

@siddhantsangwan
Copy link
Contributor

@jasonosullivan34 let's discuss here and get consensus on the approach so you will not have to make unnecessary changes.

@adoroszlai
Copy link
Contributor

only datanode ID/address and requested space

wondering if the logs will be excessive if the volumes are printed as well.

I guess it doesn't need to print each volume (or even full volumes) in SCM. Datanode-level summary may be a good middle ground.

@github-actions github-actions bot removed the stale label Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants