In the latest version ofAzure Stack HCI is included the ability to create stretched clusters to extend a cluster ofAzure Stack HCI in two different locations (rooms, buildings or even different cities). This disaster recovery solution provides storage replication (synchronous or asynchronous) and contemplates encryption, local site resilience and automatic failover of virtual machines. This article explores the possible architectures and features of the solution.
To further improve the built-in resilience in the Azure Stack HCI solution, it is possible to implement a cluster consisting of two groups of nodes, defined "stretched cluster". Each group is located in a different site and must contain a minimum of two nodes. A stretched cluster can consist of a minimum of four to sixteen physical nodes (maximum number of nodes supported by an Azure Stack HCI cluster), which must satisfy the standard hardware requirements for HCI solutions.
Going into the detail of the architecture, the components and functionalities used are:
- Azure Stack HCI. The minimum required version is 20H2, deployed as an Azure hybrid service and released in December 2020. This is a hyper-converged infrastructure (HCI), where different hardware components are removed, substitutes from the software, able to combine the layer of compute, storage and network in one solution.
- Storage Replica. The technology included in Windows Server that allows replication of volumes between servers or between clusters for disaster recovery purposes.
- Live Migration. The Hyper-V feature that allows you to easily move virtual machines (VMs) running on one Hyper-V host to another, without having downtime. This feature is useful for managing expected or scheduled downtime.
- Witness resource. Witness is a mandatory component within Azure Stack HCI clusters. To implement it, you can choose an Azure Cloud Witness or a File Share Witness. Azure Cloud Witness is the recommended choice for Azure Stack HCI stretched clusters as long as all nodes have a reliable internet connection.
An Azure Stack HCI stretched cluster is based on the use of Storage Replica and it is possible to have a synchronous or asynchronous replica of the data:
- Using the synchronous replication data is mirrored between sites on a low-latency network. Volumes are crash-consistent to ensure zero data loss at the file system level during a failure event. The requirement for synchronous replication applicable to stretched clusters enforces network latency of 5 ms of round trip between the two groups of nodes located in the replicated sites. Depending on the connectivity characteristics of the physical network, this constraint generally translates into approx 30-45 Km away. With this configuration, if there is a problem affecting the availability of a site, the cluster is able to automatically transfer workloads to the nodes of the site not affected by the problem to minimize potential downtime.
- The asynchronous replication mirrors data between sites over network links with higher latencies, but there is no guarantee that both sites have identical copies of the data when a failure event occurs. In the presence of asynchronous replication, it is necessary to manually bring the target volumes online to the other site following a failover.
There are two types of stretched clusters: active-active and active-passive.
An active site is a site that has resources and hosts roles and workloads to which clients can connect. A passive site is a site that does not dispense roles or workloads for clients, but is awaiting a failover from the active site for disaster recovery purposes.
Replication in an active-passive stretched cluster has a preferred direction, while replication in an active-active stretched cluster can take place bi-directionally from both sites.
Azure Stack HCI and Storage Replica also support data deduplication, useful to increase the usable storage capacity, identifying duplicate portions of files and archiving them only once. Starting with Windows Server 2019, deduplication is available on volumes formatted with Resilient File System (ReFS), which is the recommended file system for Azure Stack HCI. In Azure Stack HCI stretched clusters, it is recommended to enable Data Deduplication only on the nodes of the source cluster, and not on target nodes, who always receive deduplicated copies of each volume.
Conclusions
The ability to extend clusters Azure Stack HCI in two different locations allows you to implement disaster recovery architectures in a way that is fully integrated into the solution, without the need to adopt third-party products. This characteristic, combined with the ability to connect Azure Stack HCI with Azure services to achieve a hybrid hyper-converged system, makes it a complete solution, stable and reliable, able to meet the most advanced needs in hosting business critical workloads.