Skip to content

Conversation

@zawadzkidiana
Copy link
Contributor

What changed?

  • Added --reason flag (short form -r) to the cadence domain failover command
  • Added Reason field to FailoverDomainRequest and UpdateDomainRequest types
  • Updated the server-side handler to store the reason in FailoverEvent
  • Made sure the reason flows through from CLI to database in the failover history
  • Updated the ToUpdateDomainRequest mapper to pass the reason through

Why?
We needed better visibility into why failovers were happening. Right now when you look at failover history you can see when and where failovers happened, but not why. This makes it hard to debug issues or understand patterns. With this change, operators can provide context like "planned maintenance", "emergency DR", "testing", etc. and it gets stored with the failover event. Makes it way easier to go back and understand what was going on during production incidents.

How did you test it?

  • Verified the CLI flag shows up in help output: ./cadence domain failover --help
  • Ran make bins - everything compiles
  • Ran make lint - passes
  • Ran go test ./tools/cli/... - all CLI tests pass
  • Ran go test ./common/domain/... - all domain handler tests pass
  • Manually checked that the flag accepts strings

The flow from CLI to database is tested through the existing test suite. The reason goes: CLI flag → FailoverDomainRequest → UpdateDomainRequest → FailoverEvent → gets stored in domain data.

Potential risks
Pretty low risk overall since it's backward compatible. The reason field is optional in all the types so old clients can still talk to new servers and vice versa. If you don't provide a reason it just stays empty.

Main thing to watch would be if someone passes in a really long reason string - it goes into domain data which is stored as a map, so there might be size limits there. But that's more of an edge case than a real risk.

Also if someone is relying on the exact format of the FailoverEvent JSON in domain data, this adds a new field. But since it's optional and has the omitempty tag, it shouldn't break anything.

Release notes

Added optional --reason flag to domain failover command. This lets you specify why a failover is happening and the reason gets stored in the failover history. No migration or schema changes needed.

Documentation Changes
hould probably update the CLI docs to mention the new --reason flag and show some examples of how to use it. The flag shows up in --help but would be good to have it in the actual documentation too.

Signed-off-by: Diana Zawadzki <[email protected]>
Signed-off-by: Diana Zawadzki <[email protected]>
- Add --reason flag to 'cadence domain failover' command with default value 'default maintenance'
- Add --reason flag to 'cadence admin cluster failover start' command
- Extend FailoverParams and FailoverActivityParams to propagate reason through workflow
- Update FailoverActivity to use FailoverDomain API when reason is provided for proper tracking
- Add Reason field to FailoverDomainRequest type definition
- Maintain backward compatibility by falling back to UpdateDomain when no reason provided

This allows operators to provide context for failover operations which will be
stored in the failover history for better operational transparency and debugging.

Signed-off-by: Diana Zawadzki <[email protected]>
- Add FailoverReason field to UpdateDomainRequest type
- Update ToUpdateDomainRequest mapper to include reason field
- Add Reason field to FailoverEvent struct for history storage
- Update NewFailoverEvent to accept and store reason parameter
- Pass reason through handleFailoverRequest to failover history
- Update admin failover tests to expect reason in workflow input

This completes the end-to-end flow for tracking failover reasons from CLI
through to database storage in the failover history.

Signed-off-by: Diana Zawadzki <[email protected]>
Regenerate .gen/go/shared/shared.go after IDL submodule update to include
the reason field in FailoverDomainRequest from the updated thrift definition.

Signed-off-by: Diana Zawadzki <[email protected]>
- Remove reason field from admin cluster failover workflow (as per reviewer suggestion)
- Remove Reason from FailoverParams and FailoverActivityParams
- Revert FailoverActivity to use UpdateDomain (original implementation)
- Remove --reason flag from admin cluster failover start command
- Change --reason default value from "default maintenance" to empty string
- Update tests to remove reason from workflow input expectations

Only the domain failover command now has the --reason flag, allowing users
to optionally provide context without enforcing a default.

Signed-off-by: Diana Zawadzki <[email protected]>
Copy link
Member

@timl3136 timl3136 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, maybe we can make the pr title more descriptive.

@zawadzkidiana zawadzkidiana changed the title feat: Reason update feat: Add Reason Variable to Failovers Nov 18, 2025
@zawadzkidiana
Copy link
Contributor Author

Re-running 1 failed Golang integratiuon test with sqlite (pull_request); all other tests pass.

@zawadzkidiana
Copy link
Contributor Author

Github outage is likely cause of sqlite integration error; will re-run test again after problem is resolved.

@zawadzkidiana zawadzkidiana self-assigned this Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants