Skip to content

Conversation

@psafont
Copy link
Member

@psafont psafont commented Oct 29, 2025

An xcp-ng user reported a failure when enabling HA, with the only error being a Not_found, with this change we now know that it's because the IP of the coordinator is not present in the local database's ha_peers value.

While doing this:

  • I also removed several finds and hd to reduce these kind of exceptions and make the problematic uses easier to find in the future
  • The localdb interface was rebustified by using options
  • the cluster_stack values were reified and consolidated into constants, they are problematic because they are strings instead of an enum, but we can tame them a bit

@psafont psafont marked this pull request as draft October 30, 2025 08:36
@psafont
Copy link
Member Author

psafont commented Oct 30, 2025

I thought I had make-checked it after rebasing on top of master. Some code changes used the changed functions, so I need to make further changes

Comment on lines 1904 to +1934
List.iter
(fun sr ->
let vdi = Xha_statefile.find_or_create ~__context ~sr ~cluster_stack in
statefile_vdis := vdi :: !statefile_vdis
)
srs ;
[sr] ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could drop this List.iter since the list only has a single element

Comment on lines +2041 to +2073
List.iter
(fun (_, exn) ->
(* Perform a disable since the pool HA state isn't consistent *)
error "Attempting to disable HA pool-wide" ;
Helpers.log_exn_continue
"Disabling HA after a failure during enable" disable_internal
__context ;
raise exn
)
errors ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and above changed the code from raising the first exception to raising all of them one after another - I guess it'd be hard for anything to depend on this, so it should be alright (might increase noise in the logs though)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't have resumable exceptions, only the first one is actually reported, or none at all

@psafont psafont force-pushed the dev/pau/haxn branch 2 times, most recently from 75ce642 to 52d1637 Compare October 30, 2025 10:19
@psafont psafont requested a review from last-genius October 30, 2025 10:20
@psafont psafont marked this pull request as ready for review October 30, 2025 10:33
@psafont
Copy link
Member Author

psafont commented Oct 30, 2025

Fixed another bug because the cluster stack values are stringy, so I created a variant to model them. I would have like to use an enum at the API level, but that is very invasive and we know how problematic that can be

@psafont psafont requested a review from BengangY October 30, 2025 10:37
UNNECESSARY_LENGTH=$(local_grep "List.length.*=+\s*0")
UNNECESSARY_LENGTH=$((UNNECESSARY_LENGTH+$(local_grep "0\s*=+\s*List.length")))
UNNECESSARY_LENGTH=$((UNNECESSARY_LENGTH+$(local_grep "List.length.*\s>\s*0")))
UNNECESSARY_LENGTH=$((UNNECESSARY_LENGTH+$(local_grep "List.length.*\s<>\s*0")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest exchanging List.length.*\s>\s*0 and 0\s*<\s*List.length to bring similar logic code closer.


let ok_or_raise map_error = function Ok v -> v | Error exn -> map_error exn

let master_address_exn __FUN e =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

master_address_exn, master_uuid_exn, and master_not_in_liveset_exn are similar, and only msg is different between them. Could they be combined into one function with an additional argument, e.g. type/kind, etc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That changes the current functions to have 3 lines instead of 4. It's not worth the loss of locality, IMO

| [pbd] ->
pbd
| pbd :: pbds ->
(* shouldn't happen... delete all but first pbd to make db consistent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be logged?

(** Retrieves the value for the key, returns a value when it's found and is a
valid int, otherwise is [None] *)

val get_of_string : (string -> 'a option) -> string -> 'a option
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads a bit complicated and difficult to grasp. I believe this could be phrased as a monad .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the best name I could muster. It's a get and then applying a user-provided function through bind. I really don't know how to name it in a succinct way.

The Not_found and hd exceptions keep popping up, and it's difficult to
find them when there are no backtraces logged.
Remove usages if them, even if they are not problematic so the actual
problematic ones can be flushed out over time.

Signed-off-by: Pau Ruiz Safont <pau.safont@vates.tech>
Until now, the valid string values were not written anywhere, change the
situation.

Ideally this would be done using an enum in the idl, but unfortunately
this changes types of existing parameters in API calls, so it's quite
risky. Instead have a conservative approach to only enumerate the valid
values and make Constants the only source of truth for these values,
including default ones, which were scatterred around previously.

Signed-off-by: Pau Ruiz Safont <pau.safont@vates.tech>
Previously it was opened if the default stack was selected, but this
could actually be different from XHAd

Signed-off-by: Pau Ruiz Safont <pau.safont@vates.tech>
Allows callers to avoid exceptions, the previous get is now called
get_exn so it's clear which users have to be have to be mindful of
exceptions.

Not all get_exn were converted to get, previous behaviour was
widespread, and doing the change without changing behaviour is not
trivial, better to do it only once its detected.

Signed-off-by: Pau Ruiz Safont <pau.safont@vates.tech>
Joining the liveset has been found to raise `Not_found` in some cases.
Transform these exceptions to others that are more readable and show the
exact cause.

Signed-off-by: Pau Ruiz Safont <pau.safont@vates.tech>
Signed-off-by: Pau Ruiz Safont <pau.safont@vates.tech>
When joining a liveset, Not_found errors could be raised, but the users
did not have information on what was that it could not be found. Add
which IP address and which UUID could not be found.

Signed-off-by: Pau Ruiz Safont <pau.safont@vates.tech>
@psafont psafont added this pull request to the merge queue Nov 3, 2025
Merged via the queue into xapi-project:master with commit ae866c7 Nov 3, 2025
16 checks passed
@psafont psafont deleted the dev/pau/haxn branch November 6, 2025 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants