|
| 1 | +--- |
| 2 | + date: 2025-11-17 |
| 3 | + title: Introducing RamenDR Starter Kit |
| 4 | + summary: A New Pattern to illustrate Regional DR for Virtualization Workloads running on OpenShift Data Foundation |
| 5 | + author: Martin Jackson |
| 6 | + blog_tags: |
| 7 | + - patterns |
| 8 | + - announce |
| 9 | +--- |
| 10 | +:toc: |
| 11 | +:imagesdir: /images |
| 12 | + |
| 13 | +We are excited to announce that the link:https://validatedpatterns.io/patterns/ramendr-starter-kit/[**validatedpatterns-sandbox/ramendr-starter-kit**] repository is now available and has reached the Sandbox tier of Validated Patterns. |
| 14 | + |
| 15 | +== The Pattern |
| 16 | + |
| 17 | +This Validated Pattern draws on previous work that models Regional Disaster Recovery, adds Virtualization to |
| 18 | +the managed clusters, and starts virtual machines and can fail them over and back between managed clusters. |
| 19 | + |
| 20 | +The pattern ensures that all of the prerequisites are set up correctly and in order, and ensures that things like |
| 21 | +the SSL CA certificate copying that is necessary for both the Ceph replication and the OADP/Velero replication |
| 22 | +will work correctly. |
| 23 | + |
| 24 | +The user is in control of when the failover happens; the pattern provides a script to do the explicit failover |
| 25 | +required for Ramen Regional DR of a discovered application. |
| 26 | + |
| 27 | +== Why Does DR Matter? |
| 28 | + |
| 29 | +In a perfect world, every application would have its own knowledge of where it is available and would shard and |
| 30 | +replicate its own data. But many appplications were built without these concepts in mind, and even if a company |
| 31 | +wanted to and could afford re-writing every application, it could not re-write them and deploy them all at once. |
| 32 | + |
| 33 | +Thus, users benefit from being able to rely on technology products and solutions to enable a regional disaster |
| 34 | +recovery capability when the application does not support it natively. |
| 35 | + |
| 36 | +The ability to recover a workload in the event of a regional disaster is considered a requirement in several |
| 37 | +industries for applications that the user deems critical enough to require DR support for, but unable to provide |
| 38 | +it natively in the application. |
| 39 | + |
| 40 | +== Learnings from Developing the Pattern: On the use of AI to generate scripts |
| 41 | + |
| 42 | +This pattern is also noteworthy in that all of the major shell scripts in the pattern were written by |
| 43 | +Cursor. This was a major learning experience, both in the capabilities of modern AI coding tools, and in some |
| 44 | +of their limitations. |
| 45 | + |
| 46 | +=== The Good |
| 47 | + |
| 48 | +* Error handling and visual output are better than the shell sripts (or Ansible code) I would have written if |
| 49 | +I had written all of this from scratch. |
| 50 | +* The "inner loop" of development felt a lot faster using the generated code than if I had written it all from |
| 51 | +scratch. The value in this pattern is in the use of the components together, not in finding new and novel |
| 52 | +ways to retrieve certificate material from a running OpenShift cluster. |
| 53 | + |
| 54 | +=== The Bad |
| 55 | + |
| 56 | +* Even when the context "knew" it was working on OpenShift and Hive, it used different mechanisms to retrieve |
| 57 | +kubeconfig files for managed clusters. I had to remind it to use a known-good mechanism, which had worked for |
| 58 | +downloading kubeconfigs to the user workstation. |
| 59 | +* Several of these scripts are bash scripts wrappen in Kubernetes jobs or cronjobs. The generator had some problems |
| 60 | +with using local variables in places it could not, and in using shell here documents in places that was not allowed |
| 61 | +in YAML. Eventually I set the context that we were better off using `File.get` calls and externalizing the scripts |
| 62 | +from the jobs altogether. |
| 63 | + |
| 64 | +=== The Ugly |
| 65 | + |
| 66 | +* I am uncomfortable at the level of duplication in the code. Time will tell whether some of these scripts will become |
| 67 | +problematic to maintain. A more rigorous analysis might find several opportunities to refactor code. |
| 68 | +* The sheer volume of code makes it a bit daunting to look at. All of the major scripts in the pattern are over 150 |
| 69 | +lines long, and the longest (as of this publication) is over 1300 lines long. |
| 70 | +* Some of the choices of technique and loading dependencies were a bit too generic. We have images for Validated |
| 71 | +Patterns that provide things like a Python interpreter with access to the YAML module, the AWS CLI, and other things |
| 72 | +that turned out to be useful. I left in the cursor frameworks for downloading things like the AWS CLI, because they |
| 73 | +correctly detect that those dependencies are already installed and may prove beneficial if we move to different |
| 74 | +images. |
| 75 | + |
| 76 | +== DR Terminology - What are we talking about? |
| 77 | + |
| 78 | +**High Availability (“HA”)** includes all characteristics, qualities and workflows of a system that prevent |
| 79 | +unavailability events for workloads. This is a very broad category, and includes things like redundancy built into |
| 80 | +individual disks, such that failure of a single drive does not result in outage to the workload. Load balancing, |
| 81 | +redundant power supplies, running a workload across multiple fault domains, are some of the techniques that belong to |
| 82 | +HA, because they keep the workload from becoming unavailable in the first place. Usually HA is completely automatic in that it does not require a real-time human in the loop. |
| 83 | + |
| 84 | +**Disaster Recovery (“DR”)** includes the characteristics, qualities and workflows of a system to recover from an |
| 85 | +outage event when there has been data loss. DR events often also include things that are recognized as major |
| 86 | +environmental disasters (such as weather events like hurricanes, tornadoes, fires), or other large-scale problems that |
| 87 | +cause widespread devastation or disruption to a location where workloads run, such that critical personnel might also |
| 88 | +be affected (i.e. unavailable because they are dead or disabled) and questions of how decisions will be made without |
| 89 | +key decision makers are also considered. (This is often included under the heading of “Business Continuity,” which is |
| 90 | +closely related to DR.) There are two critical differences between HA and DR: The first is the expectation of human |
| 91 | +decision-making in the loop, and the other is the data loss aspect. That is, in a DR event we know we have lost data; |
| 92 | +we are working on how much is acceptable to lose and how quickly we can restore workloads. This is what makes it |
| 93 | +fundamentally a different thing than HA; but some organizations do not really see or enforce this distinction and that |
| 94 | +leads to a lot of confusion. Some vendors also do not strongly make this distinction, which does not discourage that |
| 95 | +confusion. |
| 96 | + |
| 97 | +DR policies can be driven by external regulatory or legal requirements, or an organization’s internal understanding of |
| 98 | +what such external legal and regulatory requirements mean. That is to say - the law may not specifically require a |
| 99 | +particular level of DR, but the organization interprets the law to mean that is what they need to do to be compliant |
| 100 | +to the law or regulation. The Sarbanes Oxley Act (“SOX”) in the US was adopted after the Enron and Worldcom financial |
| 101 | +scandals of the early 2000’s, but includes a number of requirements for accurate financial reporting, which many |
| 102 | +organizations have used to justify and fund substantial BC/DR programs. |
| 103 | + |
| 104 | +**Business Continuity: (“BC”, but usually used together with DR as “BCDR” or BC/DR”)** refers to primarily the people |
| 105 | +side of recovery from disasters. Large organizations will have teams that focus on BC/DR and use that term in the team |
| 106 | +title or name. Such teams will be responsible for making sure that engineering and application groups are compliant to |
| 107 | +the organization’s BC/DR policies. This can involve scheduling and running BC/DR “drills” and actual live testing of |
| 108 | +BC/DR technologies. |
| 109 | + |
| 110 | +**Recovery Time Objective (“RTO”)** is the amount of time it takes to restore a failed workload to service. This is |
| 111 | +NOT the amount of data that is tolerable to lose - that is defined by the companion RPO. |
| 112 | + |
| 113 | +**Recovery Point Objective (“RPO”)** is the amount of data a workload can stand to lose. One confusing aspect of RPO is that it can be defined as a time interval (as opposed to, say, a number of transactions). But an RPO of “5 minutes” |
| 114 | +should be read as “we want to lose no more than 5 minutes’ worth of data. |
| 115 | + |
| 116 | +RPO/RTO: So lots of people want a 0/0 RPO/RTO, often without understanding what it takes to implement that. It can be |
| 117 | +fantastically expensive, even for the world’s largest and best-funded organizations. |
| 118 | + |
| 119 | +== Special Thanks |
| 120 | + |
| 121 | +This pattern was an especially challenging one to design and complete, because of the number of elements in it |
| 122 | +and the timing issues inherent in eventual-consistency models. Therefore, special thanks are due to the following |
| 123 | +people, without whom this pattern would not exist: |
| 124 | + |
| 125 | +* The authors of the original link:https://github.com/validatedpatterns/regional-resiliency-pattern[regional-resiliency-pattern], which provided the foundation for the ODF and RamenDR components, and building the managed clusters via Hive |
| 126 | +* Aswin Suryanarayanan, who helped immensely with some late challenges with Submariner |
| 127 | +* Annette Clewett, without whom this pattern would not exist. Annette took the time to thoroughly explain all of RamenDR's dependencies and how to orchestrate them all correctly. |
0 commit comments