Skip to content

Conversation

@mbharatheesha
Copy link
Contributor

PR Type

  • New Feature

Description

This PR introduces changes to allow for dynamically reconfiguring the launch_hw parameter for the velodyne driver so that the driver can be started at run time instead of always at startup

Review Procedure

  1. I guess the first step would be to establish consensus on whether the support for making launch_hw parameter reconfigurable is desired.
  2. Then it would be good to agree on whether the approach taken in this PR is ok to go ahead with for other lidar types or a discussion on alternate approaches.
  3. Reproduce the test results that are listed below

I tested these changes by running the driver with an actual VLP16 connected by doing the following tests:

  • Start the driver with launch_hw set to false. Then change the parameter to true. The driver successfully starts communicating with connected hardware and publishes pointclouds
  • Start the driver with launch_hw set to true. The driver publishes pointclouds as expected. Then change the parameter to false. The driver stops publishing pointclouds.
  • If the parameter was already true and it is set to true again, nothing happens. Same if the parameter was false and set to false again.

Remarks

  1. I changed the C++ standard to C++ 20 because jthread support is integrated into stl. And jthread simplifies the stopping and starting of threads when launch_hw is changed at run time. The behavior in this PR could also be achieved with std::thread and C++ 17, but I just felt the possibility to use request_stop and the stop_token features made for better readability.
  2. To avoid errors regarding redeclaration of already declared parameters on dynamic reconfigure, I added conditionals to only declare a parameter if it didn't already exist.

While the desired dynamic reconfiguration works without any errors, there is one unexplained behavior that I would like some feedback on:

  • It seems like the while loop in the thread gets stuck on packet_queue_.pop() and doesn't allow the thread to finish. For example, if I try to do a decoder_thread_.join() in the clean_up_on_reconfigure() function before deleting the hw interface and hw monitor objects, the call to clean_up_on_reconfigure() never finishes.

Pre-Review Checklist for the PR Author

  • Assign PR to reviewer

Checklist for the PR Reviewer

Reviewers should check the checkboxes below before approval.

  • Commits are properly organized and messages are according to the guideline
  • (Optional) Unit tests have been written for new behavior
  • PR title describes the changes

Post-Review Checklist for the PR Author

PR Author should check the checkboxes below before merging.

  • All open points are addressed and tracked via issues or tickets

CI Checks

  • Build and test for PR: Required to pass before the merge.

Signed-off-by: Mukunda Bharatheesha <mukunda.bharatheesha@nobleo.nl>
Signed-off-by: Mukunda Bharatheesha <mukunda.bharatheesha@nobleo.nl>
@mbharatheesha
Copy link
Contributor Author

@mojomex I was working on this feature for our application that required us to enable or disable lidar hardware at runtime. I thought this might also be interesting to other users and hence thought of creating this PR. If this is a desired feature, I can also work on similar features for the supported lidar types.

@mbharatheesha mbharatheesha changed the title feat(velodyne): Enable dynamic reconfigure of launch_hw parameter feat(velodyne): enable dynamic reconfigure of launch_hw parameter Feb 6, 2025
Signed-off-by: Mukunda Bharatheesha <mukunda.bharatheesha@nobleo.nl>
@mbharatheesha
Copy link
Contributor Author

I made improvements to the decoder thread management. Now, the decoder thread is monitored and confirmed it exits gracefully before reconfiguration happens. I checked this by giving the decoder thread a custom name and monitor it on htop when launch_hw is false and launch_hw is true.

launch_hw is false

image

We can see that the VelDecThread is running and is sorted to be at the lowest when sorted on % %CPU use. In this configuration, the thread was basically doing nothing as I wasn't publishing anything on velodyne_packets.

Reconfiguring phase after setting launch_hw to true

image
During the reconfiguration process, the thread is exited and joined to the calling thread. And it can also be seen on htop that the VelDecThread disappears from the htop list.

After reconfiguration complete on setting launch_hw to true

image
Once the reconfiguration is complete, the VelDecThread re-appears on the htop list. And now, it is on the top of the pile when sorted on % CPU use. This is natural because it is actually doing the work of processing lidar packets and publishing ROS pointcloud messages.

@mojomex
Copy link
Collaborator

mojomex commented Feb 10, 2025

Hi @mbharatheesha, thanks for the PR!

We've indeed been thinking about this but haven't gotten around to it yet.
This functionality is pretty close to what ROS2 Lifecycle Nodes are aiming to do: launch a node (possibly allocate buffers, validate parameters etc.) and on a service call activate it (connect to sensor etc.).
By implementing this feature using lifecycle nodes, standard tooling could interact with Nebula as well.

While I share your enthusiasm for C++20, there are still some Nebula users (e.g. users of some ECUs) whose compilers don't support C++20. So at the moment, as much as I'd want to move to C++20, we'll have to wait for another bit of time.

Otherwise, super nice PR! Please let me know if you would be up for making those changes or if that is too much work at the moment.

@mbharatheesha
Copy link
Contributor Author

@mojomex Thanks for your feedback.

We've indeed been thinking about this but haven't gotten around to it yet.
This functionality is pretty close to what ROS2 Lifecycle Nodes are aiming to do: launch a node (possibly allocate buffers, validate parameters etc.) and on a service call activate it (connect to sensor etc.).
By implementing this feature using lifecycle nodes, standard tooling could interact with Nebula as well.We've indeed been thinking about this but haven't gotten around to it yet.

I understand. Initially that was something I thought too looking at the recent implementations in Ouster drivers with LifecycleNodes. But, I thought it would make the delta really big and indeed, I wasn't aware of the direction you guys were thinking about. So, I decided against it.

While I share your enthusiasm for C++20, there are still some Nebula users (e.g. users of some ECUs) whose compilers don't support C++20. So at the moment, as much as I'd want to move to C++20, we'll have to wait for another bit of time.

Sure, I can revert the implementation to C++17 and use std::thread based approach as was the case before.

Otherwise, super nice PR! Please let me know if you would be up for making those changes or if that is too much work at the moment.

As of the next three weeks, I don't have time to look into the lifecycle implementation approach. After that, I can take a look at this again. In the meanwhile, if you guys think this can be a stand-in, let me know. Otherwise, I am fine to close this PR and re-open a new one later when I get a chance to work on LifeCycle Node approaches.

@mojomex
Copy link
Collaborator

mojomex commented Feb 12, 2025

@drwnz What do you think?

@drwnz
Copy link
Collaborator

drwnz commented Feb 13, 2025

@mbharatheesha @mojomex thanks for proposing this!
As @mojomex has said, long term we definitely want to propose a LifeCycle node approach. However, as this requires some alignment with Autoware, it's not going to be fast. So if this PR can easily be supported in C++17, then I would be in favour of testing and merging it as I think it would also be helpful in some of our use cases.

@mbharatheesha
Copy link
Contributor Author

mbharatheesha commented Feb 13, 2025

@drwnz Alright. I will make that change over the weekend. Also, I think similar PRs for hesai and robosense would be nice. I will also make separate PRs for them to keep focused deltas per supported lidar type.

@knzo25
Copy link
Contributor

knzo25 commented Apr 2, 2025

@mbharatheesha
Do you have any update? 🙏

@mbharatheesha
Copy link
Contributor Author

@knzo25 @mojomex Sorry for my silence on this. I got occupied with other issues and then had some holidays. I just returned and tested the proposed changes in this PR with the latest main. Unfortunately, there are some new issues that seem to show up now:

  1. Before updating to latest main, I was able to seamlessly toggle the launch_hw boolean and it was stable. Now, I notice instability with the toggling. Changing from false -> true works once. But trying to set the value back to false doesn't disable the pointclouds anymore. Although, the set parameter request returns a success message.
  2. It seems some new developments were also made to the transport_drivers. Previously, the driver would crash if the hardware wasn't ready and the driver was started with launch_hw is true. So, I had added a wait loop with a reconnect attempt every 5s until the hardware would become ready. However, it seems now that a similar change is done in the transport_drivers package where there is a silent wait for about 135s and then the driver crashes if hardware is not ready. I may be wrong about this, but I need to investigate.

That said, I also noticed some recent comments from @mojomex here. This idea seems more promising/flexible than the ones proposed in this PR.

So, I am getting more inclined towards Lifecycle implementation to have this sorted out cleanly. I would appreciate some feedback/thoughts about this before proceeding. Because, it seems like a larger group discussion is required before taking up changes like Lifecycle node implementations. Please correct me if I am wrong.

@knzo25
Copy link
Contributor

knzo25 commented Apr 9, 2025

@mojomex How would you like to proceed ? 🙏

@mbharatheesha
Copy link
Contributor Author

mbharatheesha commented Apr 20, 2025

@mojomex @knzo25 @drwnz I took a leap of faith that lifecycle nodes would be considered going forward and implemented the same in this branch

Good news is I could do initial tests using a VLP 16 with hardware and no hardware and in both cases, the transitions work as expected.

I can do multiple lifecycle transitions, i.e., unconfigured->configure->activate-deactivate->cleanup->configure->activate->deactivate->cleanup->....->shutdown.

When connected to hardware, pointclouds only get published when the lifecycle node is in active state.

I don't have access to other sensor types though. I may need some help with testing those.

Also, it would be great if you could share some thoughts on the direction I have taken. Particularly:

  • Is support needed for both node types, ie., Lifecycle and Non-Lifecycle
  • If only Lifecycle implementation is the way forward, are there any concerns with the implementation so far?

Once I hear back, I will update the tests accordingly and open a PR and close this one.

@mbharatheesha
Copy link
Contributor Author

@mojomex @drwnz @knzo25 Over the last couple of weeks, I did some testing with the lifecycle implementations from the branch mentioned above, by integrating it with our use case. I thought of sharing some insights I gathered that are generally applicable to Lifecycle nodes:

  • To fully utilize the benefit of Lifecycle implementations, it seems imperative to also have a LifecycleManager (similar to what Nav2 provides) that can programmatically manage the lifecycle states in a robust way. Otherwise, I feel it is quite cumbersome to integrate these lifecycle nodes into applications that may not all be using lifecycle approach.
    • I did make a minimal implementation myself but it required quite some effort to get it to work reliably when used in combination with composable nodes that depend on the lifecycle transitions. Essentially, to keep things non-blocking, it may be desirable to use async service call APIs to manage lifecycle transitions programmatically. And this needs very careful management.
  • Further, the available support for event triggered transitions in XML launch files is limited and forces one to use Python launch files to bringup these lifecycle nodes as desired by the use case. And this may not be desired by all users/projects.

I am not against using Lifecycle implementations based on these insights, but it just turned out to be more trouble to use in our usecase. I think it may still be relevant for usecases that this code base mainly targets at.

So, I will still be happy to add lifecycle contributions into this repo. However, I am feeling more inclined towards having both supported. Since most of the "business logic" for the driver is split into separate packages here anyway, it may be a matter of creating <lidar_type>_lifecycle_ros_wrapper.xpp files in addition to what are already available. That way, users can choose whatever suits their application.

But I guess the decision is up to you guys and I can adapt my contribution accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants