Skip to content

Commit c43334d

Browse files
authored
Merge pull request #619 from mhjacks/add_ramendr_starter_kit
Initial docs for RamenDR Starter Kit
2 parents c87ee3f + 121d2c2 commit c43334d

16 files changed

+561
-0
lines changed
Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
---
2+
date: 2025-11-17
3+
title: Introducing RamenDR Starter Kit
4+
summary: A New Pattern to illustrate Regional DR for Virtualization Workloads running on OpenShift Data Foundation
5+
author: Martin Jackson
6+
blog_tags:
7+
- patterns
8+
- announce
9+
---
10+
:toc:
11+
:imagesdir: /images
12+
13+
We are excited to announce that the link:https://validatedpatterns.io/patterns/ramendr-starter-kit/[**validatedpatterns-sandbox/ramendr-starter-kit**] repository is now available and has reached the Sandbox tier of Validated Patterns.
14+
15+
== The Pattern
16+
17+
This Validated Pattern draws on previous work that models Regional Disaster Recovery, adds Virtualization to
18+
the managed clusters, and starts virtual machines and can fail them over and back between managed clusters.
19+
20+
The pattern ensures that all of the prerequisites are set up correctly and in order, and ensures that things like
21+
the SSL CA certificate copying that is necessary for both the Ceph replication and the OADP/Velero replication
22+
will work correctly.
23+
24+
The user is in control of when the failover happens; the pattern provides a script to do the explicit failover
25+
required for Ramen Regional DR of a discovered application.
26+
27+
== Why Does DR Matter?
28+
29+
In a perfect world, every application would have its own knowledge of where it is available and would shard and
30+
replicate its own data. But many appplications were built without these concepts in mind, and even if a company
31+
wanted to and could afford re-writing every application, it could not re-write them and deploy them all at once.
32+
33+
Thus, users benefit from being able to rely on technology products and solutions to enable a regional disaster
34+
recovery capability when the application does not support it natively.
35+
36+
The ability to recover a workload in the event of a regional disaster is considered a requirement in several
37+
industries for applications that the user deems critical enough to require DR support for, but unable to provide
38+
it natively in the application.
39+
40+
== Learnings from Developing the Pattern: On the use of AI to generate scripts
41+
42+
This pattern is also noteworthy in that all of the major shell scripts in the pattern were written by
43+
Cursor. This was a major learning experience, both in the capabilities of modern AI coding tools, and in some
44+
of their limitations.
45+
46+
=== The Good
47+
48+
* Error handling and visual output are better than the shell sripts (or Ansible code) I would have written if
49+
I had written all of this from scratch.
50+
* The "inner loop" of development felt a lot faster using the generated code than if I had written it all from
51+
scratch. The value in this pattern is in the use of the components together, not in finding new and novel
52+
ways to retrieve certificate material from a running OpenShift cluster.
53+
54+
=== The Bad
55+
56+
* Even when the context "knew" it was working on OpenShift and Hive, it used different mechanisms to retrieve
57+
kubeconfig files for managed clusters. I had to remind it to use a known-good mechanism, which had worked for
58+
downloading kubeconfigs to the user workstation.
59+
* Several of these scripts are bash scripts wrappen in Kubernetes jobs or cronjobs. The generator had some problems
60+
with using local variables in places it could not, and in using shell here documents in places that was not allowed
61+
in YAML. Eventually I set the context that we were better off using `File.get` calls and externalizing the scripts
62+
from the jobs altogether.
63+
64+
=== The Ugly
65+
66+
* I am uncomfortable at the level of duplication in the code. Time will tell whether some of these scripts will become
67+
problematic to maintain. A more rigorous analysis might find several opportunities to refactor code.
68+
* The sheer volume of code makes it a bit daunting to look at. All of the major scripts in the pattern are over 150
69+
lines long, and the longest (as of this publication) is over 1300 lines long.
70+
* Some of the choices of technique and loading dependencies were a bit too generic. We have images for Validated
71+
Patterns that provide things like a Python interpreter with access to the YAML module, the AWS CLI, and other things
72+
that turned out to be useful. I left in the cursor frameworks for downloading things like the AWS CLI, because they
73+
correctly detect that those dependencies are already installed and may prove beneficial if we move to different
74+
images.
75+
76+
== DR Terminology - What are we talking about?
77+
78+
**High Availability (“HA”)** includes all characteristics, qualities and workflows of a system that prevent
79+
unavailability events for workloads. This is a very broad category, and includes things like redundancy built into
80+
individual disks, such that failure of a single drive does not result in outage to the workload. Load balancing,
81+
redundant power supplies, running a workload across multiple fault domains, are some of the techniques that belong to
82+
HA, because they keep the workload from becoming unavailable in the first place. Usually HA is completely automatic in that it does not require a real-time human in the loop.
83+
84+
**Disaster Recovery (“DR”)** includes the characteristics, qualities and workflows of a system to recover from an
85+
outage event when there has been data loss. DR events often also include things that are recognized as major
86+
environmental disasters (such as weather events like hurricanes, tornadoes, fires), or other large-scale problems that
87+
cause widespread devastation or disruption to a location where workloads run, such that critical personnel might also
88+
be affected (i.e. unavailable because they are dead or disabled) and questions of how decisions will be made without
89+
key decision makers are also considered. (This is often included under the heading of “Business Continuity,” which is
90+
closely related to DR.) There are two critical differences between HA and DR: The first is the expectation of human
91+
decision-making in the loop, and the other is the data loss aspect. That is, in a DR event we know we have lost data;
92+
we are working on how much is acceptable to lose and how quickly we can restore workloads. This is what makes it
93+
fundamentally a different thing than HA; but some organizations do not really see or enforce this distinction and that
94+
leads to a lot of confusion. Some vendors also do not strongly make this distinction, which does not discourage that
95+
confusion.
96+
97+
DR policies can be driven by external regulatory or legal requirements, or an organization’s internal understanding of
98+
what such external legal and regulatory requirements mean. That is to say - the law may not specifically require a
99+
particular level of DR, but the organization interprets the law to mean that is what they need to do to be compliant
100+
to the law or regulation. The Sarbanes Oxley Act (“SOX”) in the US was adopted after the Enron and Worldcom financial
101+
scandals of the early 2000’s, but includes a number of requirements for accurate financial reporting, which many
102+
organizations have used to justify and fund substantial BC/DR programs.
103+
104+
**Business Continuity: (“BC”, but usually used together with DR as “BCDR” or BC/DR”)** refers to primarily the people
105+
side of recovery from disasters. Large organizations will have teams that focus on BC/DR and use that term in the team
106+
title or name. Such teams will be responsible for making sure that engineering and application groups are compliant to
107+
the organization’s BC/DR policies. This can involve scheduling and running BC/DR “drills” and actual live testing of
108+
BC/DR technologies.
109+
110+
**Recovery Time Objective (“RTO”)** is the amount of time it takes to restore a failed workload to service. This is
111+
NOT the amount of data that is tolerable to lose - that is defined by the companion RPO.
112+
113+
**Recovery Point Objective (“RPO”)** is the amount of data a workload can stand to lose. One confusing aspect of RPO is that it can be defined as a time interval (as opposed to, say, a number of transactions). But an RPO of “5 minutes”
114+
should be read as “we want to lose no more than 5 minutes’ worth of data.
115+
116+
RPO/RTO: So lots of people want a 0/0 RPO/RTO, often without understanding what it takes to implement that. It can be
117+
fantastically expensive, even for the world’s largest and best-funded organizations.
118+
119+
== Special Thanks
120+
121+
This pattern was an especially challenging one to design and complete, because of the number of elements in it
122+
and the timing issues inherent in eventual-consistency models. Therefore, special thanks are due to the following
123+
people, without whom this pattern would not exist:
124+
125+
* The authors of the original link:https://github.com/validatedpatterns/regional-resiliency-pattern[regional-resiliency-pattern], which provided the foundation for the ODF and RamenDR components, and building the managed clusters via Hive
126+
* Aswin Suryanarayanan, who helped immensely with some late challenges with Submariner
127+
* Annette Clewett, without whom this pattern would not exist. Annette took the time to thoroughly explain all of RamenDR's dependencies and how to orchestrate them all correctly.
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
---
2+
title: RamenDR Starter Kit
3+
date: 2025-11-13
4+
tier: sandbox
5+
summary: This pattern demonstrates the use of Red Hat OpenShift Data Foundations for Virtualization Regional Disaster Recovery
6+
rh_products:
7+
- Red Hat OpenShift Container Platform
8+
- Red Hat OpenShift Virtualization
9+
- Red Hat Enterprise Linux
10+
- Red Hat OpenShift Data Foundation
11+
- Red Hat OpenShift Data Foundation MultiCluster Orchestrator
12+
- Red Hat OpenShift Data Foundation DR Hub Operator
13+
- Red Hat Advanced Cluster Management
14+
industries: []
15+
aliases: /ramendr-starter-kit/
16+
pattern_logo: ansible-edge.png
17+
links:
18+
github: https://github.com/validatedpatterns-sandbox/ramendr-starter-kit/
19+
install: getting-started
20+
bugs: https://github.com/validatedpatterns-sandbox/ramendr-starter-kit/issues
21+
feedback: https://docs.google.com/forms/d/e/1FAIpQLScI76b6tD1WyPu2-d_9CCVDr3Fu5jYERthqLKJDUGwqBg7Vcg/viewform
22+
ci: ramendr-starter-kit
23+
---
24+
25+
:toc:
26+
:imagesdir: /images
27+
:_content-type: ASSEMBLY
28+
include::modules/comm-attributes.adoc[]
29+
30+
== RamenDR Regional Disaster Recovery with Virtualization Starter Kit
31+
32+
This pattern sets up three clusters as recommended for OpenShift Data Foundations Regional Disaster Recovery as
33+
documented link:https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.18/html-single/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/index[here].
34+
35+
Of additional interest is that the workload that it sets up protection for and can failover involves running virtual
36+
machines.
37+
38+
The setup process is relatively intricate; the goal of this pattern is to handle all the intricate parts and present
39+
a functional DR-capable starting point for Virtual Machine workloads. In particular this pattern takes care to sequence
40+
installations and validate pre-requisites for all of the core components of the Disaster Recovery system.
41+
42+
In particular, this pattern must be customized to specify DNS basedomains for the managed clusters, which makes
43+
forking the pattern (which we generally recommend anyway, in case you want to make other customizations) effectively
44+
a requirement. The link:https://validatedpatterns-sandbox/patterns/getting-started[**Getting Started**] doc has
45+
details on what needs to be changed and how to commit and push those changes.
46+
47+
=== Background
48+
49+
It would be ideal if all applications in the world understood availability concepts natively and had their own
50+
integrated regional failover strategies. However, many workloads do not, and users who need regional disaster recovery
51+
capabilities need to solve this problem for the applications that cannot solve it for themselves.
52+
53+
This pattern uses OpenShift Virtualization (the productization of Kubevirt) to simulate the Edge environment for VMs.
54+
55+
==== Solution elements
56+
57+
==== Red Hat Technologies
58+
59+
* Red Hat OpenShift Container Platform (Kubernetes)
60+
* Red Hat Advanced Cluster Management (RHACM)
61+
* Red Hat OpenShift Data Foundations (ODF, including Multicluster Orchestrator)
62+
* Submariner (VPN)
63+
* Red Hat OpenShift GitOps (ArgoCD)
64+
* OpenShift Virtualization (Kubevirt)
65+
* Red Hat Enterprise Linux 9 (on the VMs)
66+
67+
==== Other technologies this pattern Uses
68+
69+
* HashiCorp Vault (Community Edition)
70+
* External Secrets Operator (Community Edition)
71+
72+
=== Architecture
73+
74+
.ramendr-architecture-diagram
75+
image::/images/ramendr-starter-kit/ramendr-architecture.drawio.png[ramendr-starter-kit-architecture,title="RamenDR Starter Kit Architecture"]
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
---
2+
title: Cluster sizing
3+
weight: 50
4+
aliases: /ramendr-starter-kit/cluster-sizing/
5+
---
6+
7+
:toc:
8+
:imagesdir: /images
9+
:_content-type: ASSEMBLY
10+
11+
include::modules/comm-attributes.adoc[]
12+
include::modules/ramendr-starter-kit/metadata-ramendr-starter-kit.adoc[]
13+
14+
The OpenShift hub cluster is made of 3 Control Plane nodes and 3 Workers for the cluster; the 3 workers are standard
15+
compute nodes. For the node sizes we used the **m5.4xlarge** on AWS.
16+
17+
This pattern has only been tested on AWS only right now because of the integration of both Hive and OpenShift
18+
Virtualization. We may publish a later revision that supports more hyperscalers.
19+
20+
include::modules/cluster-sizing-template.adoc[]
21+

0 commit comments

Comments
 (0)