vCenter Site Recovery Manager (SRM) 5.X – Part 7

Now we are going to discuss Recovery Plan, Testing and Performing a Failover and Failback

===================================================

Recovery Plan

Recovery Plans are created at the recovery site so that they are accessible and can be run from the recovery site when there is a disaster at the protected site. A Recovery Plan is executed to Failover the virtual machine workload that was running at the protected site to the recovery site. It can also be used to perform Planned Migrations. A Recovery Plan is a series of configuration steps that has to be performed to Failover the protected virtual machines to the recovery site.

Note :- A Recovery Plan should be associated with at least one  Protection Group.

Creating a Recovery Plan

Once you have Protection Groups created, the next step would be to create a Recovery Plan for these Protection Groups. The Recovery Plan should be created at the recovery site SRM. This is because, in the event of a disaster, the protected site may become inaccessible. Hence, for very obvious reasons, a Recovery Plan is always created at the recovery site.

In other words, a recovery plan is like a runbook which is based on a protection group. The SRM recovery plan consists of the following:

  • List of protected VMs included in the protection group
  • Startup order for the VMs
  • Custom steps if applicable

The following steps show you how to create a Recovery Plan:

1. Navigate to the vCenter Server’s inventory home and click on Site Recovery.

2. Click on Recovery Plans on the left pane and Click on Create Recovery Plan to bring up the Create Recovery Plan wizard

RP13. In the Create Recovery Plan wizard, select the Recovery Site and click on Next to continue. If the Recovery Plan wizard is initiated at a site, then the wizard will select the other site in the site pair as the recovery site. For example, if you were to initiate the Recovery Plan wizard at SITE-A, then the wizard will auto select SITE-B as the recovery site and vice versa.

RP24. As shown in the following screenshot, select the Protection Group that you would like to use and click on Next to continue:

RP35. In the next wizard screen, click on Test Networks. The test networks are set to Auto by default. The Auto networks are isolated bubble networks and don’t connect to any physical network. So unless you have manually created an isolated test network port group at the recovery site, you can leave it at the Auto setting. Click on Next to continue:

RP46. In the next screen, enter a Recovery Plan Name and an optional Description and click on Next to continue. The Recovery Plan name can be any name of your choice.

RP57. In the Ready to Complete window, click on Finish to create the Recovery Plan.

RP68. You should see the Create Recovery Plan task completed successfully in the Recent Tasks pane.

===========================================================

Modify Recovery Plan

Our basic recovery plan is now created, we can now configure some advanced features and properties of the recovery plan. These steps are not mandatory but take a look and play around with the settings to get a better understanding of the product.

1. Select the recovery plan created in the previous step. Notice that the action bar above will now display some additional actions like Test, Cleanup, Recovery, Reprotect and Cancel.

RP72. Select the Virtual Machines tab. Select a random VM and click Configure Recovery.

RP83. In the properties window you can modify the settings for this particular VM like IP settings on the Protected Site and the Recovery Site, the priority group this VM belongs, dependencies of other VMs, shutdown actions, and startup actions, pre-power on steps and post-power on steps.

RP94. Take a look at the recovery steps under the Recovery Steps tab and review the recovery steps that will be executed during a recovery. You can view the Test , Cleanup, Recovery and Reprotect steps from the View drop down menu.

RP10

=================================================================

Testing and Performing a Failover and Failback

In the previous section, we learned how to create Protection Groups and Recovery Plans. Now we will learn how to test Recovery Plans that are already created, how to use them to perform a Failover, Planned migration, Reprotect, and Failback.

Testing a Recovery Plan

A Recovery Plan should be tested for its readiness to make sure that it would work as expected in the event of a real disaster. Most organizations periodically review and update their recovery runbook to make sure that they have an optimized, working plan for a recovery. With SRM, the testing of a Recovery Plan can now be automated.

1. Navigate to the vCenter Server’s inventory home page and click on Site Recovery.

2. Click on Recovery Plans on the left pane and Click on the Recovery Plan that you want to test and click on the Test toolbar item to bring up the Test wizard, as shown in the following screenshot:

TP13. As shown in the following screenshot, the first screen of the wizard will indicate which of the sites have been designated as the protected and recovery sites, the site connection status, and the number of VMs protected:

TP3By default, the storage option Replicate the recent changes to the recovery site is selected. I would recommend not deselecting this option because we replicate the recent changes during a Planned Migration. So, it is important  that the ability of the array to respond to a nonscheduled replication request is tested. However, we might not need to do this if the replication is synchronous. Click on Next to continue.

5. The next screen will summarize the selected options as shown in the following screenshot. Review them and click on Start to initiate the test:

TP26. You should now see a Test Recovery Plan task in the Recent Tasks pane. Navigate to the Recovery Steps tab to watch the progress of the test as shown in the following screenshot: TP46. Once the test completes successfully, you will see the following Test Complete banner appear in the Summary tab of the Recovery Plan:

CP1

==========================================================

Performing the Cleanup after a test

We know from the previous section that during the course of the testing of a Recovery Plan, SRM executes the creation of certain elements to enact a disaster recovery in a manner that will not affect the running environment. Hence, the changes made and the objects created are temporary and have to be cleaned up after a successful test. Fortunately, this is not a manual process either. SRM provides an automated method to perform a cleanup.

The following actions will occur during a cleanup:

• The ESXi hosts will be put back into the DPM standby mode

• The Recovery VMs will be powered off

• The Suspended noncritical VMs will be powered on

• The inventory entries of the Recovery VMs will be replaced with their corresponding Shadow VM entries

• The VMFS volume will be unmounted

• The LUN device will be detached

• The storage initiators and Refresh Storage System will be rescanned

• The writable snapshot that was created will be deleted

• The Port Group and the vSwitch that were created for the bubble network will be removed

The following procedure will guide you through the steps required for the cleanup:

1. Navigate to the vCenter Server’s inventory home page and click on  Site Recovery.

2. Click on Recovery Plans on the left pane and Select the Recovery Plan with the status Test Complete.

3. Click on the Cleanup item in the toolbar to bring up the cleanup wizard:

CP24. In the cleanup wizard, the details regarding the current protected and recovery sites, their connection status, and the number of protected VMs are displayed. Note that the Force Cleanup option is grayed out. This option will only be available if the cleanup operation attempt has failed during the previous attempt. Click on Next to continue.

5. The next screen will summarize the cleanup options selected. Click on Start to initiate the cleanup.

7. The Recent Tasks pane should show the Cleanup Test Recovery task as successfully completed.

==================================================================

Performing a Planned Migration

VMware SRM can be used to migrate your workload from one site to another. A Planned Migration is done when the protected site is available and is running the virtual machine workload.

There are many use cases, of which the following two are prominent:

• When migrating your infrastructure to a new hardware

• When migrating your virtual machine storage from one array to another

Note :- A Planned Migration will replicate the most recent changes with the help of storage replication. This is not optional.

The following procedure will guide you through the steps required to perform  a Planned Migration:

1. Navigate to the vCenter Server’s inventory home page and click on  Site Recovery.

2. Click on Recovery Plans on the left pane and Select the Recovery Plan that was created for the Planned Migration and click on the Recovery toolbar item, as shown in the following screenshot, to bring up the recovery wizard:

RP13. Read the info in the Recovery Confirmation window, check the “I understand that this process will permanently alter the virtual machine and infrastructure of both the protected and recovery datacenters.” box and make sure that Planned Migration is selected under the Recovery Type. Click Next to continue.

PM14. The next screen will summarize the wizard options that were selected.  Click on Start to initiate the migration.

5. The Recent Tasks pane should now show the Failover Recovery Plan task  as successfully completed.

PM26. When the recovery process completes, you should see a message as depicted in the screenshot below.

PM37. Notice that the VMs on the Protected Site are powered off and on the Recovery Site, powered on.

PM4The Planned Migration will not proceed further if any of the recovery steps fail. However, when you re-attempt the Planned Migration, it would resume the operation from the step at which it failed. This enables you to fix the problem  and resume from where it failed, saving a considerable amount of time.

==================================================================

Performing a disaster recovery (Failover)

A Failover is performed when the protected site becomes fully or partially unavailable. We use a Recovery Plan that is already created and tested to perform the Failover. Keep in mind that SRM does not automatically determine the occurrence of a disaster at the protected site; hence, a recovery is always to be manually initiated.

The following steps show how to perform a Failover:

1. Navigate to the vCenter Server’s inventory home page and click on Site Recovery.

2. Click on Recovery Plans on the left pane and Select the Recovery Plan that was created for the disaster recovery and click on the Recovery toolbar item to bring up the recovery wizard.

4. In the recovery wizard, as shown in the following screenshot, agree to the Recovery Confirmation, set the Recovery Type as Disaster Recovery, and click on Next to continue:

PM75. The next screen will summarize the selected wizard options. Click on Start to perform the recovery.

6. The Recovery Steps tab of the Recovery Plan will show the progress of each of the steps involved.

7. Once the Failover is complete, the status of the Recovery Plan should read Recovery Complete.

The recovery steps involved in a disaster recovery (Failover) is the same as in that of a Planned Migration, except for the fact that SRM ignores any unsuccessful attempts to pre-synchronize the storage or shut down the protected virtual machines.

============================================================

Initiate Unplanned Failover

This process is very similar to the one performed with the Planned Migration but in this case the Protection Site is not available.

1. On the Protected Site, simulate a disaster by powering off the vCenter Server and the ESXi hosts containing the protected VMs.

2. On the Recovery Site, open the Site Recovery Manager, navigate to Recovery Plans and press the red Recovery button, just like in the previous step.

3. By now you should get a warning stating that the connection to the vCenter Server at the Protected Site has been lost.

4. In the Recovery Confirmation window, notice that the Planned Migration is now grayed out. Select the “I understand that this process will permanently alter the virtual machine and infrastructure of both the protected and recovery datacenters.” option and click Next to proceed.

PM55. Review the settings and press Start to begin the recovery process.

6. Switch over to the Recovery Steps tab and monitor the progress. Notice the errors stating that the connection to the remote server is down. Despite the failure the recovery process continues. The VMs should be up and running in couple of minutes.

PM6If the failover and unplanned migration process both complete successfully, your SRM implementation is properly configured.

=============================================================

Forced Recovery

Forced Recovery is used when the protected site is no longer operational enough to allow SRM to perform its tasks at the protected site before the Failover.

For instance, there is an unexpected power outage at the protected site causing not just the ESXi hosts but also the storage array to become unavailable. In this scenario, SRM cannot perform any of its tasks, such as shutting down the protected VMs or replicating the most recent storage changes (if the replication is asynchronous), at the protected site.

Enabling Forced Recovery for a site

Forced Recovery is not enabled by default, but it can be enabled at the site’s advanced settings.

To do so, perform the following steps:

1. Navigate to the vCenter Server’s inventory home page and click on Site Recovery.

2. Click on Sites on the left pane.

3. Right-click on the site and click on Advanced Settings.

PM84. In the Advanced Settings windows, select the category recovery from the  left pane.

5. Select the checkbox against the recovery.forceRecovery setting, as shown in the following screenshot, and click on OK to enable Forced Recovery:

PM9

Running Forced Recovery

Running Forced Recovery will skip all the steps that otherwise should have been performed against the protected site. You should use Forced Recovery only during circumstances where the protected site is completely down, leaving no connectivity to either the ESXi hosts or the storage array.

The following steps show how Forced Recovery is executed:

1. Navigate to the vCenter Server’s inventory home page and click on  Site Recovery.

2. Click on Recovery Plans on the left pane and Right-click on the Recovery Plan that you want to run and click on Recovery.

3. In the recovery wizard, select the I understand that this process will permanently alter the virtual machines and infrastructure of both the protected and recovery datacenters checkbox.

4. Select the Recovery Type as Disaster Recovery, select the checkbox Forced Recovery – recovery site operations only, and click on Next:

PM105. You will be prompted to confirm the Forced Recovery. Click on Yes to confirm.

PM116. Review the operation summary and click on Start to initiate the Forced Recovery.

=======================================================

Reprotecting a site

After you Failover the workload from a protected site to the recovery site, the recovery site has no protection enabled for the new workload that it has begun hosting. SRM provides a method to enable protection of the recovery site. This method is called Reprotect.

A Reprotect operation will reverse the direction of the replication, thus designating the recovery site as the new protected site. The Reprotect operation can only be done on a Recovery Plan with the Recovery Complete status. Also, keep in mind that a Reprotect operation can only be executed when you have repaired the failed site  and made it available to become a recovery site.

For instance, let’s assume that SITE-A and SITE-B are the protected and recovery sites, respectively. If workload at SITE-A were failed over to SITE-B, then to Reprotect SITE-B, SITE-A should be made accessible. This would mean fixing  the problems that caused the failure at SITE-A.

The following steps show how to perform the Reprotect operation:

1. Navigate to the vCenter Server’s inventory home page and click on  Site Recovery.

2. Click on Recovery Plans in the left pane and Select the Recovery Plan with the Recovery Complete status, as shown  in the following screenshot, and click on the toolbar item Reprotect:

PM123. In the Reprotect wizard screen, agree to the Reprotect Confirmation and click on Next to continue:

PM134. In the next screen, click on Start to begin the Reprotect operation.

5. You should see a progressing Reprotect Recovery Plan task in the Recent Tasks pane. Also, the Recovery Steps tab will show the progress of every step involved in the Reprotect operation.

6. The status of the Recovery Plan after a successful Reprotect operation  should read Ready.

======================================================

Failback to the protected site

In a scenario where, after a Failover, the original protected site is fixed and is  made available to host the virtual machine workload, you can use SRM to  automate a Failback.

The Failback, although automated, is a two-step process, which is as follows:

1. Step 1 we need to perform a Reprotect operation.

2. Step 2 will be to perform a Failover.

============================================================

This concludes the series about implementing and configuring the VMware vCenter Site Recovery Manager 5. I hope you enjoyed the series  🙂 

Cheers…..Roshan Jha!

Click here to go to vCenter Site Recovery Manager (SRM) 5.X – Part 1

Click here to go to vCenter Site Recovery Manager (SRM) 5.X – Part 2

Click here to go to vCenter Site Recovery Manager (SRM) 5.X – Part 3

Click here to go to vCenter Site Recovery Manager (SRM) 5.X – Part 4

Click here to go to vCenter Site Recovery Manager (SRM) 5.X – Part 5

Click here to go to vCenter Site Recovery Manager (SRM) 5.X – Part 6

Click here to go to vCenter Site Recovery Manager (SRM) 5.X – Part 7

Note :- I have used pictures in this post from SRM book written by Abhilash GB and from blog (http://defaultreasoning.com) by Marek.Z and would like to Thank both of them 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

*