validatedpatterns
diff --git a/‎content/blog/2025-11-17-introducing-ramendr-starter-kit.adoc‎
Lines changed: 127 additions & 0 deletions b/‎content/blog/2025-11-17-introducing-ramendr-starter-kit.adoc‎
Lines changed: 127 additions & 0 deletions
diff --git a/‎content/patterns/ramendr-starter-kit/_index.adoc‎
Lines changed: 75 additions & 0 deletions b/‎content/patterns/ramendr-starter-kit/_index.adoc‎
Lines changed: 75 additions & 0 deletions
diff --git a/‎content/patterns/ramendr-starter-kit/cluster-sizing.adoc‎
Lines changed: 21 additions & 0 deletions b/‎content/patterns/ramendr-starter-kit/cluster-sizing.adoc‎
Lines changed: 21 additions & 0 deletions
@@ -0,0 +1,127 @@
+---
+ date: 2025-11-17
+ title: Introducing RamenDR Starter Kit
+ summary: A New Pattern to illustrate Regional DR for Virtualization Workloads running on OpenShift Data Foundation
+ author: Martin Jackson
+ blog_tags:
+ - patterns
+ - announce
+---
+:toc:
+:imagesdir: /images
+
+We are excited to announce that the link:https://validatedpatterns.io/patterns/ramendr-starter-kit/[**validatedpatterns-sandbox/ramendr-starter-kit**] repository is now available and has reached the Sandbox tier of Validated Patterns.
+
+== The Pattern
+
+This Validated Pattern draws on previous work that models Regional Disaster Recovery, adds Virtualization to
+the managed clusters, and starts virtual machines and can fail them over and back between managed clusters.
+
+The pattern ensures that all of the prerequisites are set up correctly and in order, and ensures that things like
+the SSL CA certificate copying that is necessary for both the Ceph replication and the OADP/Velero replication
+will work correctly.
+
+The user is in control of when the failover happens; the pattern provides a script to do the explicit failover
+required for Ramen Regional DR of a discovered application.
+
+== Why Does DR Matter?
+
+In a perfect world, every application would have its own knowledge of where it is available and would shard and
+replicate its own data. But many appplications were built without these concepts in mind, and even if a company
+wanted to and could afford re-writing every application, it could not re-write them and deploy them all at once.
+
+Thus, users benefit from being able to rely on technology products and solutions to enable a regional disaster
+recovery capability when the application does not support it natively.
+
+The ability to recover a workload in the event of a regional disaster is considered a requirement in several
+industries for applications that the user deems critical enough to require DR support for, but unable to provide
+it natively in the application.
+
+== Learnings from Developing the Pattern: On the use of AI to generate scripts
+
+This pattern is also noteworthy in that all of the major shell scripts in the pattern were written by
+Cursor. This was a major learning experience, both in the capabilities of modern AI coding tools, and in some
+of their limitations.
+
+=== The Good
+
+* Error handling and visual output are better than the shell sripts (or Ansible code) I would have written if
+I had written all of this from scratch.
+* The "inner loop" of development felt a lot faster using the generated code than if I had written it all from
+scratch. The value in this pattern is in the use of the components together, not in finding new and novel
+ways to retrieve certificate material from a running OpenShift cluster.
+
+=== The Bad
+
+* Even when the context "knew" it was working on OpenShift and Hive, it used different mechanisms to retrieve
+kubeconfig files for managed clusters. I had to remind it to use a known-good mechanism, which had worked for
+downloading kubeconfigs to the user workstation.
+* Several of these scripts are bash scripts wrappen in Kubernetes jobs or cronjobs. The generator had some problems
+with using local variables in places it could not, and in using shell here documents in places that was not allowed
+in YAML. Eventually I set the context that we were better off using `File.get` calls and externalizing the scripts
+from the jobs altogether.
+
+=== The Ugly
+
+* I am uncomfortable at the level of duplication in the code. Time will tell whether some of these scripts will become
+problematic to maintain. A more rigorous analysis might find several opportunities to refactor code.
+* The sheer volume of code makes it a bit daunting to look at. All of the major scripts in the pattern are over 150
+lines long, and the longest (as of this publication) is over 1300 lines long.
+* Some of the choices of technique and loading dependencies were a bit too generic. We have images for Validated
+Patterns that provide things like a Python interpreter with access to the YAML module, the AWS CLI, and other things
+that turned out to be useful. I left in the cursor frameworks for downloading things like the AWS CLI, because they
+correctly detect that those dependencies are already installed and may prove beneficial if we move to different
+images.
+
+== DR Terminology - What are we talking about?
+
+**High Availability (“HA”)** includes all characteristics, qualities and workflows of a system that prevent
+unavailability events for workloads. This is a very broad category, and includes things like redundancy built into
+individual disks, such that failure of a single drive does not result in outage to the workload. Load balancing,
+redundant power supplies, running a workload across multiple fault domains, are some of the techniques that belong to
+HA, because they keep the workload from becoming unavailable in the first place. Usually HA is completely automatic in that it does not require a real-time human in the loop.
+
+**Disaster Recovery (“DR”)** includes the characteristics, qualities and workflows of a system to recover from an
+outage event when there has been data loss. DR events often also include things that are recognized as major
+environmental disasters (such as weather events like hurricanes, tornadoes, fires), or other large-scale problems that
+cause widespread devastation or disruption to a location where workloads run, such that critical personnel might also
+be affected (i.e. unavailable because they are dead or disabled) and questions of how decisions will be made without
+key decision makers are also considered. (This is often included under the heading of “Business Continuity,” which is
+closely related to DR.) There are two critical differences between HA and DR: The first is the expectation of human
+decision-making in the loop, and the other is the data loss aspect. That is, in a DR event we know we have lost data;
+we are working on how much is acceptable to lose and how quickly we can restore workloads. This is what makes it
+fundamentally a different thing than HA; but some organizations do not really see or enforce this distinction and that
+leads to a lot of confusion. Some vendors also do not strongly make this distinction, which does not discourage that
+confusion.
+
+DR policies can be driven by external regulatory or legal requirements, or an organization’s internal understanding of
+what such external legal and regulatory requirements mean. That is to say - the law may not specifically require a
+particular level of DR, but the organization interprets the law to mean that is what they need to do to be compliant
+to the law or regulation. The Sarbanes Oxley Act (“SOX”) in the US was adopted after the Enron and Worldcom financial
+scandals of the early 2000’s, but includes a number of requirements for accurate financial reporting, which many
+organizations have used to justify and fund substantial BC/DR programs.
+
+**Business Continuity: (“BC”, but usually used together with DR as “BCDR” or BC/DR”)** refers to primarily the people
+side of recovery from disasters. Large organizations will have teams that focus on BC/DR and use that term in the team
+title or name. Such teams will be responsible for making sure that engineering and application groups are compliant to
+the organization’s BC/DR policies. This can involve scheduling and running BC/DR “drills” and actual live testing of
+BC/DR technologies.
+
+**Recovery Time Objective (“RTO”)** is the amount of time it takes to restore a failed workload to service. This is
+NOT the amount of data that is tolerable to lose - that is defined by the companion RPO.
+
+**Recovery Point Objective (“RPO”)** is the amount of data a workload can stand to lose. One confusing aspect of RPO is that it can be defined as a time interval (as opposed to, say, a number of transactions). But an RPO of “5 minutes”
+should be read as “we want to lose no more than 5 minutes’ worth of data.
+
+RPO/RTO: So lots of people want a 0/0 RPO/RTO, often without understanding what it takes to implement that. It can be
+fantastically expensive, even for the world’s largest and best-funded organizations.
+
+== Special Thanks
+
+This pattern was an especially challenging one to design and complete, because of the number of elements in it
+and the timing issues inherent in eventual-consistency models. Therefore, special thanks are due to the following
+people, without whom this pattern would not exist:
+
+* The authors of the original link:https://github.com/validatedpatterns/regional-resiliency-pattern[regional-resiliency-pattern], which provided the foundation for the ODF and RamenDR components, and building the managed clusters via Hive
+* Aswin Suryanarayanan, who helped immensely with some late challenges with Submariner
+* Annette Clewett, without whom this pattern would not exist. Annette took the time to thoroughly explain all of RamenDR's dependencies and how to orchestrate them all correctly.
@@ -0,0 +1,75 @@
+---
+title: RamenDR Starter Kit
+date: 2025-11-13
+tier: sandbox
+summary: This pattern demonstrates the use of Red Hat OpenShift Data Foundations for Virtualization Regional Disaster Recovery
+rh_products:
+- Red Hat OpenShift Container Platform
+- Red Hat OpenShift Virtualization
+- Red Hat Enterprise Linux
+- Red Hat OpenShift Data Foundation
+- Red Hat OpenShift Data Foundation MultiCluster Orchestrator
+- Red Hat OpenShift Data Foundation DR Hub Operator
+- Red Hat Advanced Cluster Management
+industries: []
+aliases: /ramendr-starter-kit/
+pattern_logo: ansible-edge.png
+links:
+  github: https://github.com/validatedpatterns-sandbox/ramendr-starter-kit/
+  install: getting-started
+  bugs: https://github.com/validatedpatterns-sandbox/ramendr-starter-kit/issues
+  feedback: https://docs.google.com/forms/d/e/1FAIpQLScI76b6tD1WyPu2-d_9CCVDr3Fu5jYERthqLKJDUGwqBg7Vcg/viewform
+ci: ramendr-starter-kit
+---
+
+:toc:
+:imagesdir: /images
+:_content-type: ASSEMBLY
+include::modules/comm-attributes.adoc[]
+
+== RamenDR Regional Disaster Recovery with Virtualization Starter Kit
+
+This pattern sets up three clusters as recommended for OpenShift Data Foundations Regional Disaster Recovery as 
+documented link:https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.18/html-single/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/index[here].
+
+Of additional interest is that the workload that it sets up protection for and can failover involves running virtual
+machines.
+
+The setup process is relatively intricate; the goal of this pattern is to handle all the intricate parts and present
+a functional DR-capable starting point for Virtual Machine workloads. In particular this pattern takes care to sequence
+installations and validate pre-requisites for all of the core components of the Disaster Recovery system.
+
+In particular, this pattern must be customized to specify DNS basedomains for the managed clusters, which makes 
+forking the pattern (which we generally recommend anyway, in case you want to make other customizations) effectively
+a requirement. The link:https://validatedpatterns-sandbox/patterns/getting-started[**Getting Started**] doc has
+details on what needs to be changed and how to commit and push those changes.
+
+=== Background
+
+It would be ideal if all applications in the world understood availability concepts natively and had their own 
+integrated regional failover strategies. However, many workloads do not, and users who need regional disaster recovery
+capabilities need to solve this problem for the applications that cannot solve it for themselves.
+
+This pattern uses OpenShift Virtualization (the productization of Kubevirt) to simulate the Edge environment for VMs.
+
+==== Solution elements
+
+==== Red Hat Technologies
+
+* Red Hat OpenShift Container Platform (Kubernetes)
+* Red Hat Advanced Cluster Management (RHACM)
+* Red Hat OpenShift Data Foundations (ODF, including Multicluster Orchestrator)
+* Submariner (VPN)
+* Red Hat OpenShift GitOps (ArgoCD)
+* OpenShift Virtualization (Kubevirt)
+* Red Hat Enterprise Linux 9 (on the VMs)
+
+==== Other technologies this pattern Uses
+
+* HashiCorp Vault (Community Edition)
+* External Secrets Operator (Community Edition)
+
+=== Architecture
+
+.ramendr-architecture-diagram
+image::/images/ramendr-starter-kit/ramendr-architecture.drawio.png[ramendr-starter-kit-architecture,title="RamenDR Starter Kit Architecture"]
@@ -0,0 +1,21 @@
+---
+title: Cluster sizing
+weight: 50
+aliases: /ramendr-starter-kit/cluster-sizing/
+---
+
+:toc:
+:imagesdir: /images
+:_content-type: ASSEMBLY
+
+include::modules/comm-attributes.adoc[]
+include::modules/ramendr-starter-kit/metadata-ramendr-starter-kit.adoc[]
+
+The OpenShift hub cluster is made of 3 Control Plane nodes and 3 Workers for the cluster; the 3 workers are standard
+compute nodes.  For the node sizes we used the **m5.4xlarge** on AWS.
+
+This pattern has only been tested on AWS only right now because of the integration of both Hive and OpenShift
+Virtualization. We may publish a later revision that supports more hyperscalers.
+
+include::modules/cluster-sizing-template.adoc[]
+