Skip to content

6.4 Common errors

robert-sanfeliu edited this page Feb 3, 2025 · 3 revisions

To properly detect errors, you need to have access to the Proactive UI as well as access to the Kubernetes cluster for NebulOuS core.

Cloud providers:

  • Currently, if SAL is restarted, all resources (cloud accounts, edge devices and applications) are lost. Before a testing session, it is adviced to re-register cloud providers even if they were registered before.

  • Use only one region per cloud account.

  • Make sure to follow the Wiki page on how to register the cloud account (https://github.com/eu-nebulous/nebulous/wiki/2.1-Managing-cloud-providers#aws).

  • The "eye" button can only be clicked once. Clicking it twice will cause an error to show in the UI, tho, NebulOuS will not be affected.

  • Once the "eye" button has been clicked, you can not click it again. If you need to make changes to the cloud provider, you will need to create a new one.

  • After registering a cloud provider and clicking the eye button, you need to wait a couple of minutes before this cloud provider can be used.

App deployment

  • Once the app is correctly defined, click the "Play" button. Never click the "play" button twice for the same app. If you need to make changes to an app after it has been deployed (successfully or unsuccessfully), use the clone button to clone the app.
  • Look at the logs of "utility evaluator" in NebulOuS core. It should show some activity. If no activity appears, it is provably due to https://github.com/eu-nebulous/exn-middleware/issues/6 If that happens, all pods on NebulOuS core (except activemq and SAL) need to be restarted.
  • Look at logs from optimiser controller, it will show traces stating that the deployment process started. First it looks for node candadites for all app components and then defines and deploys the cluster. If any of these steps fail, you will see a log trace with state = FAILED.
  • If deployment process starts correctly, you will be able to see jobs in proactive UI created for your application cluster.
  • Look at these jobs. If any is stuck at 14% for more than 5 minutes, it means that SSH credentials provided during cloud registration are invalid or the security group assigned doesn't allow the SSH connection.
  • If a VM is created on your selected cloud provider, but the deployment fails, check that the folder /tmp/node exists. If it doesn't, it means that NebulOuS couldn't SSH into the VM (missing ports) or that it wasn't allowed to execute commands.
  • To get the IP of the master of the cluster, look at the Proactive logs for that node after 5 minutes of the job starting. Search for the work "PUBLIC_IP".

App re-configuration:

Once the APP has been deployed, you can look at the logs of the solver component that is running on the application cluster master. These logs will indicate when SLO violations and metrics are received. If an SLO violation is received but the solver is not aware of the value of any metric used in the utility function, it will show an error trace.

Clone this wiki locally