-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-14084. Log details of ineligible nodes in SCMCommonPlacementPolicy#filterNodesWithSpace #9439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…cks during pipeline node selection
sodonnel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change makes sense. Once concern is that the logs are emitted too frequently, but for that to happen, there must be a lot of nodes on the cluster that are out of space on all disks. That points to a wider cluster issue that needs to be resolved via balancing, adding capacity or removing data. With the logs at debug level, the problem is hidden unless someone thinks to enable debug on this class, and system wide debug is very noisy!
This method is called for several reasons, one of them is collecting metrics, which happens every 30 seconds according to #9418 (though I guess it depends on Prometheus and other settings). My other concern is that this provides little information, only datanode ID/address and requested space. Please see |
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/SCMCommonPlacementPolicy.java
Outdated
Show resolved
Hide resolved
… and refactored SCMCommonPlacementPolicy.hasEnoughSpace to delegate to new checkSpace method
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jasonosullivan34 for updating the patch.
| if (!(node instanceof DatanodeInfo)) { | ||
| node = nodeManager.getDatanodeInfo(node); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: avoid whitespace-only change
| @@ -0,0 +1,90 @@ | |||
| package org.apache.hadoop.hdds.scm; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
License header is needed in all files. Example:
ozone/hadoop-hdds/config/src/main/java/org/apache/hadoop/hdds/conf/ConfigFileAppender.java
Lines 1 to 17 in 131eec0
| /* | |
| * Licensed to the Apache Software Foundation (ASF) under one | |
| * or more contributor license agreements. See the NOTICE file | |
| * distributed with this work for additional information | |
| * regarding copyright ownership. The ASF licenses this file | |
| * to you under the Apache License, Version 2.0 (the | |
| * "License"); you may not use this file except in compliance | |
| * with the License. You may obtain a copy of the License at | |
| * <p> | |
| * http://www.apache.org/licenses/LICENSE-2.0 | |
| * <p> | |
| * Unless required by applicable law or agreed to in writing, software | |
| * distributed under the License is distributed on an "AS IS" BASIS, | |
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| * See the License for the specific language governing permissions and | |
| * limitations under the License. | |
| */ |
| .filter(dn -> !hasEnoughSpace(dn, minRatisVolumeSizeBytes, containerSize, conf) | ||
| && !hasEnoughCommittedVolumeSpace(dn, blockSize)) | ||
| .count(); | ||
| .filter((dn) -> { | ||
| SCMDatanodeCapacityInfo info = checkSpace(dn, minRatisVolumeSizeBytes, containerSize, conf); | ||
| return !info.hasEnoughSpace() && !hasEnoughCommittedVolumeSpace(dn, blockSize); | ||
| }).count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the original code in this file since it doesn't need the SCMDatanodeCapacityInfo.
|
Thanks @jasonosullivan34 for updating the patch. Can you please check test timeout in |
|
This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jasonosullivan34 for taking this up.
Do we really want to include volume level information here? Of course more information is better, but even with 20 full datanodes and 20 volumes per datanode, that'll be a lot of logging. Excessive logging costs write performance as well.
We need to think of a different way to surface the fact that many datanodes are full with volume level information. @priyeshkaratha is the capacity distribution project that you're working on doing something in this area?
I think this change is helpful but we should restrict it to Datanode stats (capacity, used, available, reserved, committed). Curious to hear what others think.
| public String getInsufficientSpaceMessage() { | ||
| if (hasEnoughSpace()) { | ||
| return String.format("Datanode %s has sufficient space (data: %d bytes required, metadata: %d bytes required)", | ||
| datanodeDetails.getUuidString(), dataVolumeInfo.requiredSpace, metaVolumeInfo.requiredSpace); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Datanode ip address or hostname instead of UUID. Makes debugging easier.
| @Override | ||
| public String toString() { | ||
| return "SCMDatanodeCapacityInfo{" + | ||
| "datanode=" + datanodeDetails.getUuidString() + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, use ip address or hostname.
| } | ||
|
|
||
| public void addFullDataVolume(StorageReportProto report, long usableSpace) { | ||
| this.dataVolumeInfo.addFullVolume(new FullVolume(report.getStorageUuid(), usableSpace)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use volume path instead of storage uuid.
| if (!hasEnoughDataSpace()) { | ||
| return String.format("Datanode %s has no volumes with enough space to allocate %d bytes for data." + | ||
| " data=%s, metadata=%s", datanodeDetails.getUuidString(), dataVolumeInfo.requiredSpace, dataVolumeInfo, | ||
| metaVolumeInfo); | ||
| } else { | ||
| return String.format("Datanode %s has no volumes with enough space to allocate %d bytes for metadata." + | ||
| " data=%s, metadata=%s", datanodeDetails.getUuidString(), metaVolumeInfo.requiredSpace, dataVolumeInfo, | ||
| metaVolumeInfo); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I said above, perhaps it's better to only include datanode level total space, used, available, reserved, committed etc. and exclude volumes.
Good point - but do you think volume stats should be logged here as well or total DN capacity, used space etc. is good enough? @sodonnel |
@adoroszlai yes the AvailableSpaceFilter approach is pretty good, I'm just wondering if the logs will be excessive if the volumes are printed as well. Let's see if @priyeshkaratha is trying to make this observable in some other way. |
|
@jasonosullivan34 let's discuss here and get consensus on the approach so you will not have to make unnecessary changes. |
I guess it doesn't need to print each volume (or even full volumes) in SCM. Datanode-level summary may be a good middle ground. |
What changes were proposed in this pull request?
HDDS-14084. updating metadata / data size check logs pipeline node selection to info
Please describe your PR in detail:
Updating log level for metadata and data size checks to info. These checks are done as part of the node selection for adding nodes to a pipeline
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14084
How was this patch tested?
workflow run on forked git repo