Skip to content

Tango should skip unavailable backends when using distDocker #248

Open
@20wildmanj

Description

@20wildmanj

From Chaskiel re: an error that recently occured on CMU prod

The error was triggered by an operating system error (an LDAP lookup failed when tango ssh'd to the docker node). There has only been one such failure since the beginning of the semester

ERROR|2024-02-08 10:43:41,556|TangoREST|addJob request failed: Command '['ssh', '-o', 'BatchMode=yes', '-i', '/usr/local/lib/Tango/vmms/id_rsa', '-o', 'StrictHostKeyChecking=no', '-o', 'GSSAPIAuthentication=no', '[email protected]', '(docker images)']' returned non-zero exit status 255.

For the devs:
the failure was in getImages, not waitVM (which has retry logic), or a later step (where retries are more complicated because you'd have to restart the process from the copyIn phase)
It may make sense for DistDocker.getImages (and maybe DistDocker.getVms) to skip backends that are unavailable.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions