-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Description
I am not able to initialize my cluster for ray using ray-on-aml version 0.2.4. I'm running a notebook in the Python 3.8 AzureML environment. Using the following piece of code:
from ray_on_aml.core import Ray_On_AML
ray_on_aml =Ray_On_AML(ws=ws, compute_cluster ="CC-RayWorker-CPU-DS12-v2")
# May take 7 mintues or longer. Check the AML run under ray_on_aml experiment for cluster status.
ray = ray_on_aml.getRay(ci_is_head=True, num_node=2,pip_packages=["ray[air]==2.2.0","ray[data]==2.2.0","torch==1.13.0","fastparquet==2022.12.0", "azureml-mlflow==1.48.0", "pyarrow==6.0.1", "dask==2022.12.0", "adlfs==2022.11.2", "fsspec==2022.11.0"])
While the compute instance initializes successfully, the ray_on_aml job fails in the cluster with the following error:
Cleaning up all outstanding Run operations, waiting 300.0 seconds
1 items cleaning up...
Cleanup took 0.2714250087738037 seconds
Traceback (most recent call last):
File "source_file.py", line 175, in <module>
startRayMaster()
File "source_file.py", line 103, in startRayMaster
ip = socket.gethostbyname(socket.gethostname())
socket.gaierror: [Errno -2] Name or service not known
Retrying due to transient client side error HTTPSConnectionPool(host='westus-0.in.applicationinsights.azure.com', port=443): Max retries exceeded with url: /v2.1/track (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1ee8697220>: Failed to establish a new connection: [Errno -2] Name or service not known')).
2023-02-16 13:21:17,476 INFO usage_lib.py:516 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2023-02-16 13:21:17,476 INFO scripts.py:702 -- Local node IP: 10.62.79.24
2023-02-16 13:21:19,380 SUCC scripts.py:739 -- --------------------
2023-02-16 13:21:19,380 SUCC scripts.py:740 -- Ray runtime started.
2023-02-16 13:21:19,380 SUCC scripts.py:741 -- --------------------
2023-02-16 13:21:19,380 INFO scripts.py:743 -- Next steps
2023-02-16 13:21:19,381 INFO scripts.py:744 -- To connect to this Ray runtime from another node, run
2023-02-16 13:21:19,381 INFO scripts.py:747 -- ray start --address='10.62.79.24:6379'
2023-02-16 13:21:19,381 INFO scripts.py:763 -- Alternatively, use the following Python code:
2023-02-16 13:21:19,381 INFO scripts.py:765 -- import ray
2023-02-16 13:21:19,381 INFO scripts.py:769 -- ray.init(address='auto')
2023-02-16 13:21:19,381 INFO scripts.py:781 -- To connect to this Ray runtime from outside of the cluster, for example to
2023-02-16 13:21:19,381 INFO scripts.py:785 -- connect to a remote cluster from your laptop directly, use the following
2023-02-16 13:21:19,381 INFO scripts.py:789 -- Python code:
2023-02-16 13:21:19,381 INFO scripts.py:791 -- import ray
2023-02-16 13:21:19,381 INFO scripts.py:792 -- ray.init(address='ray://<head_node_ip_address>:10001')
2023-02-16 13:21:19,381 INFO scripts.py:801 -- To see the status of the cluster, use
2023-02-16 13:21:19,381 INFO scripts.py:802 -- ray status
2023-02-16 13:21:19,381 INFO scripts.py:812 -- If connection fails, check your firewall settings and network configuration.
2023-02-16 13:21:19,381 INFO scripts.py:820 -- To terminate the Ray runtime, run
2023-02-16 13:21:19,381 INFO scripts.py:821 -- ray stop
I have this entire setup within a VNet and all the compute resources have been created in the same subnet. Due to certain policies, I am forced to enable 'No Public IP'(npip) on my computes.
Could this be an issue due to my setup - npip or NSG? Or is it something to do with the library? Please help mitigate this.
Thank you
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels