Showing posts with label Site Recovery Manager. Show all posts
Showing posts with label Site Recovery Manager. Show all posts

Sunday, 8 July 2018

Unable To Pair SRM Sites: "Server certificate chain not verified"

So first things first, as of this post, you might know that I have moved out of VMware and ventured further into backup and recovery solutions domain. Currently, I work as a solutions engineer at Rubrik.

There are a lot of instances where you are unable to manage anything in Site Recovery Manager; regardless of the version (Also, applicable to vSphere Replication) and the common error that pops up on the bottom right of the web client is Server certificate chain not verified

Failed to connect to vCenter Server at vCenter_FQDN:443/sdk. Reason:
com.vmware.vim.vmomi.core.exception CertificateValidationException: Server certificate chain not verified.

This article will briefly explain only on the embedded Platform Services Controller deployment model. Similar logic needs to be extrapolated to the external deployments.

These issues are ideally seen when:
> PSC is migrated from embedded to external
> Certificates are replaced for the vCenter

I will be simplifying this KB article here for reference. Before proceeding have a powered off snapshot of the PSC and vCenters involved. 

So, for embedded deployment of VCSA:

1. SSH into the VCSA and run the below command:
# /usr/lib/vmidentity/tools/scripts/lstool.py list --url https://localhost/lookupservice/sdk --no-check-cert --ep-type com.vmware.cis.cs.identity.sso 2>/dev/null

This command will give you the ssl Trust that is currently stored on your PSC. Now, consider you are using an embedded PSC deployment on production and another embedded deployment in DR (No Enhanced Linked Mode). In this case, when you run the above command, you are expected to see just one single output where the URL section is the FQDN of your current PSC node and associated with it would be its current ssl trust. 

URL: https://current-psc.vmware.local/sts/STSService/vsphere.local
SSL trust: MIIDWDCCAkCgAwIBAgIJANr+++MJ5+WxMA0GCSqGSIb3DQEBCwUAMGgxCzAJBgNV
...Reducing output...
LqSKWg/apP1XlBV1VyC5LPZrH/rrq8+Naoj7i/P6HAzTwAAL+O10ggClaP8=

If this is your case, proceed to step (2), if not jump to step (3)

2. Run the next command:
# echo | openssl s_client -connect localhost:443

This is the current ssl Trust that is used by your deployment post the certificate replacement. Here look at the extract which speaks about the certificate chain.

Server certificate
-----BEGIN CERTIFICATE-----
MIIDWDCCyAHikleBAgIJANr+++MJ5+WxMA0GCSqGSIb3DQEBCwUAMGgxCzAJBgNV
...
LqSKWg/apP1XlBV1VyC5LPZrH/rrq8+Naoj7i/P6HAzTwAAL+O10fGhhDDqm=
-----END CERTIFICATE-----

So from here, the chain obtained from the first command (current ssl trust in PSC) does not match the chain from the second command (Actual ssl trust). And due to this mismatch you would see the chain not verified message in the UI.

To fix this, the logic is; find all the services using the thumbprint of the old ssl Trust (step 1) and update them with the thumbprint from step 2. The steps in KB article is a bit confusing, so this is what I follow to fix it.

A) Copy the SSL trust you obtained from the first command to Notepad++ Everything that starts from MIID... in my case (No need to include SSL Trust option in it).

B) The chain should contain 65 characters in one line. So in notepad++ place the mouse after a character and see what the col option reads at the bottom. Hit Enter at the mark when col: 65
Format this for the complete chain (The last line might have <65 characters which is okay)

C) Append -----BEGIN CERTIFICATE----- and -----END CERTIFICATE----- before and after the chain (5 hyphens are used before and after)

D) Save the Notepadd++ document as a .cer extension.

E) Open the certificate file that you saved and navigate to Details > Thumbprint. You will notice a string of hexa with spacing after every 2 characters. Copy this to a Notepadd++ and append : after every 2 characters, so you will end up with the thumbprint similar to: 13:1E:60:93:E4:E6:59:31:55:EB:74:51:67:2A:99:F8:3F:04:83:88

F) Next, we will export the current certificate using the below command
# /usr/lib/vmware-vmafd/bin/vecs-cli entry getcert --store MACHINE_SSL_CERT --alias __MACHINE_CERT --output /certificates/new_machine.crt

This will export the current certificate to the /certificates directory. 

G) Run the update thumbprint option using the below command:
# cd /usr/lib/vmidentity/tools/scripts/
# python ls_update_certs.py --url https://FQDN_of_Platform_Services_Controller/lookupservice.sdk --fingerprint Old_Certificate_Fingerprint --certfile New_Certificate_Path_from_/Certificates --user Administrator@vsphere.local --password 'Password'

So a sample command would be:
# python ls_update_certs.py --url https://vcsa.vmware.local --fingerprint 13:1E:60:93:E4:E6:59:31:55:EB:74:51:67:2A:99:F8:3F:04:83:88 --certfile /certificates/new_machine.crt --user Administrator@vsphere.local --password 'Password'

This will basically look for all the services using the old thumbprint (04:83:88) and then update them with the current thumbprint from the new_machine.crt

Note: Once you paste the thumbprint in the SSH, remove any extra spaces before and after the beginning and end of thumbprint respectively. I have seen the Update service fail because of this as it picks up some special character in few cases (Special characters that you would not see in the terminal) So remove the space after fingerprint and re-add the space back. Do the same before the --certfile switch too.

Re-run the commands in step 1 and 2 and the SSL trust should now match. If yes, then re-login back to the web client and you should be good to go. 

--------------------------------------------------------------------------------------------------------------------------
3. In this case, you might see two PSC URL outputs in the step 1 command. Both the PSC have the same URL, and it might or might not have the same SSL trust. 

Case 1:
Multiple URL with different SSL trust.

This would be a easy one. On the output from Step 1, you will see two outputs with same PSC URL, but different SSL trust. And one of the SSL trust from here will match the current certificate from step 2. So this means, the one that does not match is the stale one and can be removed from the STS store. 

You can remove them from the CLI, however, I stick to using Jxplorer tool to remove it from the GUI. You can connect to PSC from Jxplorer using this KB article here.

Once connected, navigate to Configuration > Sites > LookupService > Service Registrations. 
One of the fields from command in step 1 is Service ID. Which is something similar to:
Service ID: 04608398-1493-4482-881b-b35961bf5141

Locate this similar service ID in the service registrations and you should be good to remove it. 

Case 2:
Multiple URL with same SSL trust. 

In this case, after the output from step 1, you will see two same PSC URL along with the same SSL trust. And these might or might not match the output from step 2. 

The first step of this fix is:

Note down both of the service IDs from the output of step 1. Connect the Jxplorer as mentioned above. Select the service ID and on the right side, click Table Editor view and click submit. You can view the last modified date of this service registration. The service ID having the older last modified date would be the stale registration and can be removed via Jxplorer. Now, when you run the command from Step 1, it would have one output. If this matches the thumbprint from step 2, great! If not, then an additional step of updating the thumbprint needs to be performed. 

In an event of external PSC deployment, let's say one PSC in production site and one in recovery site in ELM, then the command from step 1 is supposed to populate two outputs with two different URL (production and DR PSC) since they are replicating. This will of course change if there are multiple PSCs replicating with or without a load balancer. The process would be too complex to explain using text, so in this event it would be best to involve VMware Support for assistance. 

Hope this helps!

Wednesday, 30 May 2018

Unable To Configure Or Reconfigure Protection Groups In SRM: java.net.SocketTimeoutException: Read timed out

When you try to reconfigure or create a new protection group you might run into the following message when you click on the Finish option.

 java.net.SocketTimeoutException: Read timed out

Below is a screenshot of this error:


In the web client logs for the respective vCenter you will see the below logging: 

[2018-05-30T13:49:14.548Z] [INFO ] health-status-65 com.vmware.vise.vim.cm.healthstatus.AppServerHealthService Memory usage: used=406,846,376; max=1,139,277,824; percentage=35.7109010137285%. Status: GREEN
[2018-05-30T13:49:14.549Z] [INFO ] health-status-65   c.v.v.v.cm.HealthStatusRequestHandler$HealthStatusCollectorTask Determined health status 'GREEN' in 0 ms
[2018-05-30T13:49:20.604Z] [ERROR] http-bio-9090-exec-16 70002318 100003 200007 c.v.s.c.g.wizard.addEditGroup.ProtectionGroupMutationProvider Failed to reconfigure PG [DrReplicationVmProtectionGroup:vm-protection-group-11552:67
2e1d34-cbad-46a4-ac83-a7c100547484]:  com.vmware.srm.client.topology.client.view.PairSetup$RemoteLoginFailed: java.net.SocketTimeoutException: Read timed out
        at com.vmware.srm.client.topology.impl.view.PairSetupImpl.remoteLogin(PairSetupImpl.java:126)
        at com.vmware.srm.client.infraservice.util.TopologyHelper.loginRemoteSite(TopologyHelper.java:398)
        at com.vmware.srm.client.groupservice.wizard.addEditGroup.ProtectionGroupMutationProvider.apply(ProtectionGroupMutationProvider.java:184)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.vmware.vise.data.provider.DelegatingServiceBase.invokeProviderInternal(DelegatingServiceBase.java:400)


Caused by: com.vmware.vim.vmomi.client.exception.ConnectionException: java.net.SocketTimeoutException: Read timed out
        at com.vmware.vim.vmomi.client.common.impl.ResponseImpl.setError(ResponseImpl.java:252)
        at com.vmware.vim.vmomi.client.http.impl.HttpExchange.run(HttpExchange.java:51)
        at com.vmware.vim.vmomi.client.http.impl.HttpProtocolBindingBase.executeRunnable(HttpProtocolBindingBase.java:226)
        at com.vmware.vim.vmomi.client.http.impl.HttpProtocolBindingImpl.send(HttpProtocolBindingImpl.java:110)
        at com.vmware.vim.vmomi.client.common.impl.MethodInvocationHandlerImpl$CallExecutor.sendCall(MethodInvocationHandlerImpl.java:613)
        at com.vmware.vim.vmomi.client.common.impl.MethodInvocationHandlerImpl$CallExecutor.executeCall(MethodInvocationHandlerImpl.java:594)
        at com.vmware.vim.vmomi.client.common.impl.MethodInvocationHandlerImpl.completeCall(MethodInvocationHandlerImpl.java:345)

There might an issue with ports between vCenter and SRM server and you can validate those ports using this KB here

If the ports are fine, then validate that no guest level security agents on SRM or vCenter (Windows) are blocking this traffic. 

In my case the network connection and firewall / security settings were fine, and a fix was done by performing a Modify on the SRM installation on both the sites. Once this was done, a reconfigure pairing was done and post this we were able to reconfigure the protection groups successfully. 

Thursday, 26 April 2018

SRM Service Fails To Start: "Could not initialize Vdb connection Data source name not found and no default driver specified"

In few cases, you might come across a scenario where the Site Recovery Manager service does not start and in the Event Viewer you will notice the following back trace for the vmware-dr service.

VMware vCenter Site Recovery Manager application error.
class Vmacore::Exception "DBManager error: Could not initialize Vdb connection: ODBC error: (IM002) - [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified"
[backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.5.1, build: build-6014840, tag: vmware-dr, cpu: x86_64, os: windows, buildType: release
backtrace[00] vmacore.dll[0x001F29FA]
backtrace[01] vmacore.dll[0x00067EA0]
backtrace[02] vmacore.dll[0x0006A85E]
backtrace[03] vmacore.dll[0x00024064]
backtrace[04] vmware-dr.exe[0x00107621]
backtrace[05] MSVCR120.dll[0x00066920]
backtrace[06] MSVCR120.dll[0x0005E36D]
backtrace[07] ntdll.dll[0x00092A63]
backtrace[08] vmware-dr.exe[0x00014893]
backtrace[09] vmware-dr.exe[0x00015226]
backtrace[10] windowsService.dll[0x00002BF5]
backtrace[11] windowsService.dll[0x00001F24]
backtrace[12] sechost.dll[0x00005ADA]
backtrace[13] KERNEL32.DLL[0x000013D2]
backtrace[14] ntdll.dll[0x000154E4]
[backtrace end]  

There are no logs generated in vmware-dr.log and the ODBC connection test completes successfully too. 

However, when you go to vmware-dr.xml file located under C:\Program Files\VMware\VMware vCenter Site Recovery Manager\config and search for the tag <DBManager> you will notice the <dsn> name will be incorrect.

Upon providing in the right dsn name within the <dsn> </dsn> you will then notice a new back trace when you attempt to start the service again

VMware vCenter Site Recovery Manager application error.
class Vmacore::InvalidArgumentException "Invalid argument"
[backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.5.1, build: build-6014840, tag: vmware-dr, cpu: x86_64, os: windows, buildType: release
backtrace[00] vmacore.dll[0x001F29FA]
backtrace[01] vmacore.dll[0x00067EA0]
backtrace[02] vmacore.dll[0x0006A85E]
backtrace[03] vmacore.dll[0x00024064]
backtrace[04] listener.dll[0x0000BCBC]

What I suspect is something has gone wrong with the vmware-dr.xml file and the fix for this is to re-install the SRM application with an existing database. 

Post this, the service starts successfully. Hope this helps. 

Thursday, 12 April 2018

SRM Test Recovery Fails: "Failed to create snapshots of replica devices"

When using SRM with array based replication, a test recovery operation will take a snapshot of the replica LUN, present it and mount it on the ESX server to bring up the VMs on an isolated network. 

In many instances, the test recovery would fail at the crucial step, which is taking a snapshot of the replica device. The GUI would mention: 

Failed to create snapshots of replica devices 

In this case, always look into the vmware-dr.log on the recovery site of the SRM. In my case I noticed the below snippet:

2018-04-10T11:00:12.287+01:00 error vmware-dr[16896] [Originator@6876 sub=SraCommand opID=7dd8a324:9075:7d02:758d] testFailoverStart's stderr:
--> java.io.IOException: Couldn't get lock for /tmp/santorini.log
--> at java.util.logging.FileHandler.openFiles(Unknown Source)
--> at java.util.logging.FileHandler.<init>(Unknown Source)
=================BREAK========================
--> Apr 10, 2018 11:00:12 AM com.emc.santorini.log.KLogger logWithException
--> WARNING: Unknown error: 
--> com.sun.xml.internal.ws.client.ClientTransportException: HTTP transport error: javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?
--> at com.sun.xml.internal.ws.transport.http.client.HttpClientTransport.getOutput(Unknown Source)
--> at com.sun.xml.internal.ws.transport.http.client.HttpTransportPipe.process(Unknown Source)
--> at com.sun.xml.internal.ws.transport.http.client.HttpTransportPipe.processRequest(Unknown Source)
--> at com.sun.xml.internal.ws.transport.DeferredTransportPipe.processRequest(Unknown Source)
--> at com.sun.xml.internal.ws.api.pipe.Fiber.__doRun(Unknown Source)
=================BREAK========================
--> Caused by: javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?
--> at sun.security.ssl.InputRecord.handleUnknownRecord(Unknown Source)
--> at sun.security.ssl.InputRecord.read(Unknown Source)
--> at sun.security.ssl.SSLSocketImpl.readRecord(Unknown Source)

2018-04-10T11:00:12.299+01:00 error vmware-dr[21512] [Originator@6876 sub=AbrRecoveryEngine opID=7dd8a324:9075:7d02:758d] Dr::Providers::Abr::AbrRecoveryEngine::Internal::RecoverOp::ProcessFailoverFailure: Failed to create snapshots of replica devices for group 'vm-protection-group-45026' using array pair 'array-pair-2038': (dr.storage.fault.CommandFailed) {
-->    faultCause = (dr.storage.fault.LocalizableAdapterFault) {
-->       faultCause = (vmodl.MethodFault) null, 
-->       faultMessage = <unset>, 
-->       code = "78814f38-52ff-32a5-806c-73000467afca.1049", 
-->       arg = <unset>
-->       msg = ""
-->    }, 
-->    faultMessage = <unset>, 
-->    commandName = "testFailoverStart"
-->    msg = ""
--> }
--> [context]

So here the SRA attempts to establish connection with the RecoverPoint over HTTP which from 3.5.x is disabled. And we need to allow RP and SRM to communicate over HTTPS. 

On the SRM, perform the below:

1. Open CMD in admin mode and navigate to the below location:
c:\Program Files\VMware\VMware vCenter Site Recovery Manager\storage\sra\array-type-recoverpoint

2. Then run the below command:
"c:\Program Files\VMware\VMware vCenter Site Recovery Manager\external\perl-5.14.2\bin\perl.exe" command.pl --useHttps true

In 6.5 I have seen the path to be external\perl\perl]bin\perl.exe 
So verify what the correct path is for the second command. 

You should ideally see an output like:
Successfully changed to HTTPS security mode

3. Perform this on both the SRM sites. 

On the RPA, perform the below:

1. Login to each RPA with boxmgmt account

2. [2] Setup > [8] Advanced Options > [7] Security Options > [1] Change Web Server Mode 
(option number may change)

3. You will be then presented with this message:
Do you want to disable the HTTP server (y/n)?

4. Disable HTTP and repeat this on production and recovery RPA cluster. 

Restart the SRM service on both sites and re-run the test recovery and this should now complete successfully. 

Hope this helps. 

Tuesday, 27 March 2018

Embedded Replication Server Disconnected In vSphere Replication 5.8

A vSphere replication server comes with an embedded replication service to manage all the traffic and vR queries in addition to an option of deploying add on servers. In 5.8 or older vSphere replication servers, there are scenarios where this embedded replication server is displayed as disconnected. Since this embedded service is disconnected, the replications will be in RPO violation state as the replication traffic is not manageable.

In the hbrsrv.log on the vSphere replication appliance, located under /var/log/vmware, we see the below:

repl:/var/log/vmware # grep -i "link-local" hbrsrv*

hbrsrv-402.log:2018-03-23T11:25:24.914Z [7F70AC62E720 info 'HostCreds' opID=hs-init-1d08f1ab] Ignoring link-local address for host-50: "fe80::be30:5bff:fed9:7c52"

hbrsrv.log:2018-03-23T11:25:24.914Z [7F70AC62E720 info 'HostCreds' opID=hs-init-1d08f1ab] Ignoring link-local address for host-50: "fe80::be30:5bff:fed9:7c52"

So, this is seen when the VMs being replicated are on an ESX host which has IPv6 link local address enabled and the host is using an IPv4 addressing. 

The logs, here speak in terms on host MoID, so you can find out the host name from the vCenter MOB page, https://<vcenter-ip/mob

To navigate to the host MoID section:

Content > group-d1 (Datacenters) > (Your datacenter) under childEntity > group-xx under hostFolder > domain-xx (Under childEntity) > locate the host ID

Then using this hostname, disable the IPv6 on the referenced ESX:
> Select the ESXi 
> Select Configuration
> Select Networking
> Edit Settings for vmk0 (Management) port group
> IP Address, Un-check IPv6

Then reboot that ESX host. Repeat the steps for the remaining ESX too and then finally reboot the vSphere Replication Appliance. 

Now, there should no longer be link-local logging in hbrsrv.log and the embedded server should be connected allowing the RPO syncs to resume.

Hope this helps!

Friday, 2 March 2018

SRM CentOS 7.4 IP Customization Fails

If you are using SRM 6.0.x or SRM 6.1.x and you are trying to test failover a CentOS 7.4 machine with IP Customization the Customize IP Section of the recovery fails with the message

The guest operating system '' is not supported


In the vmware-dr.log on the DR site SRM, you will notice the following:

2018-03-02T02:10:43.405Z [01032 error 'Recovery' ctxID=345cedf opID=72d8d85a] Plan 'CentOS74' failed: (vim.fault.UnsupportedGuest) {
-->    faultCause = (vmodl.MethodFault) null, 
-->    property = "guest.guestId", 
-->    unsupportedGuestOS = "", 
-->    msg = ""
--> }

This is because the CentOS7.4 is not a part of supported guest in the imgcust binaries of the 6.0 release. For CentOS 7.4 customization to work, the SRM needs to be on a 6.5 release. In my case, I upgraded vCenter to 6.5 Update 1 and SRM to 6.5.1 post which the test recovery completed without issues.

If there is no plan for immediate upgrade of your environment, but would still like to have the customizations completing, then use this workaround.

If you look at the redhat-release file
# cat /etc/redhat-release

The contents are:
CentOS Linux release 7.4.1708 (Core)

So you remove this and then add:
Red Hat Enterprise Linux Server release 7.0 (Maipo)

Since RHEL 7.0 is supported in imgcust for 6.0 the test recovery completes fine. Hope this helps!

Tuesday, 20 February 2018

SRM Service Crashes Due To CR/LF Conversion

In a postgreSQL SRM deployment, the service might crash if the Carriage Return / Line Feed bit is enabled. The back trace would not tell much, even with trivia logging I could not make much out of it. This is what I saw in vmware-dr.log:

2018-02-20T01:05:26.359Z [02712 verbose 'Replication.Folder'] Reconstructing folder '_replicationRoot':'DrReplicationRootFolder' from the database
2018-02-20T01:05:26.468Z [02712 panic 'Default'] 
--> 
--> Panic: TerminateHandler called
--> Backtrace:
--> 
--> [backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.1.1, build: build-3884620, tag: -
--> backtrace[00] vmacore.dll[0x001C568A]
--> backtrace[01] vmacore.dll[0x0005CA8F]
--> backtrace[02] vmacore.dll[0x0005DBDE]
--> backtrace[03] vmacore.dll[0x001D7405]
--> backtrace[04] vmacore.dll[0x001D74FD]
--> backtrace[05] vmacore.dll[0x001D9FD0]
<<<<SHORTENED BACKTRACE>>>>
--> backtrace[40] vmacore.dll[0x00065FEB]
--> backtrace[41] vmacore.dll[0x0015BC50]
--> backtrace[42] vmacore.dll[0x001D2A5B]
--> backtrace[43] MSVCR90.dll[0x00002FDF]
--> backtrace[44] MSVCR90.dll[0x00003080]
--> backtrace[45] KERNEL32.DLL[0x0000168D]
--> backtrace[46] ntdll.dll[0x00074629]
--> [backtrace end]

The last OpID is 02712 and even if I search with this while trivia logging being enabled, it would not give me much information.

Apparently, the CR/LF bit in ODBC driver causing some kind of truncation causing SRM service to crash.


This setting is available Under:
ODBC 64 bit > System DSN > SRM DSN (Configure) > Datasource > Page 2

Uncheck this option and then the service should start successfully.

Hope this kind of helps!

Tuesday, 6 February 2018

SRM Plugin Not Available In Web Client

Today while working on a 6.1.1 fresh SRM deployment we were unable to see the Site Recovery Manager plugin in the web client. The first thing, we do in this case is to go to the Managed Object Browser page and check if the SRM extension is registered successfully. The URL for MOB page is https://vcenter-ip-or-fqdn/mob

Here we browse further to content > ExtensionManager. Under the properties section, we should have an SRM extension, which is com.vmware.vcDr, by default. If you have installed SRM with a custom identifier then you would see something like, com.vmware.vcDr-<your-custom-identifier-name>
In our case, the extension was available.

Next, looking at the web client logs, in our case a vCenter appliance, we noticed the following:

[2018-02-06T12:00:13.283+03:00] [ERROR] vc-extensionmanager-pool-81  70000046 100002 200001 com.vmware.vise.vim.extension.VcExtensionManager Package com.vmware.vcDr-custom was not installed!
Error downloading https://SRM-Local-IP:9086/srm-client.zip. Make sure that the URL is reachable then logout/login to force another download. java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:668)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)

So the vCenter was unable to pull the plugin manually from that URL. So under the plugin package folder we did not have any SRM plugin folder. The location to this plugin package folder on vCenter appliance is:
# /etc/vmware/vsphere-client/vc-packages/vsphere-client-serenity

Here you should have a "com.vmware.vcDr-<version-ID>" folder which in our case was missing. So we had to manually dump this package in this location.

To fix this:
1. Navigate to the URL from the log from a browser,  https://SRM-Local-IP:9086/srm-client.zip
This will prompt you for a download of the plugin zip file. Download this file and put into the above mentioned vsphere-client-serenity location via a WinSCP

2. Now, we will have to manually create this plugin folder. There are few catches to this.

If you are using default plugin identifier for SRM, then the naming convention would be:
com.vmware.vcDr-<srm-version-string>

If you are using custom identifier for SRM, then the naming convention would be:
com.vmware.vcDr-customName-<srm-version-string> 

How do you find this exact SRM version string?

A) Go back to the MOB page where you had left off in ExtensionManager. Click the com.vmware.vcDr extension. This will in turn open a new page.

B) Here click on the client under VALUE 

C) Now you can see the version string and the value. In a 6.1.1 SRM for example, the version string is 6.1.1.1317

So the plugin folder now will be:
Default:
com.vmware.vcDr-6.1.1.1317

Custom:
com.vmware.vcDr-custom-6.1.1.1317

3. Copy the zip file into this folder and then extract it. The outcome would be a plugin-package.xml and a plugins folder.

4. Restart the web client service for the vCenter. The command varies for 6.5 and 6.0 vCenter.

5. Re-login back to the web client once the web client loads up and you should have the plugin.

Hope this helps!

Thursday, 25 January 2018

SRM Service Crashes During A Recovery Operation With timedFunc BackTrace

In few scenarios when you run a test recovery or a planned migration, the SRM service will crash. This might happen when you run a specific recovery plan or any recovery plan.

If you look into the vmware-dr.log you will notice the following back-trace:

--> Panic: VERIFY d:\build\ob\bora-3884620\srm\public\functional/async/timedFunc.h:210
-->
--> Backtrace:
-->
--> [backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.1.1, build: build-3884620, tag: -
--> backtrace[00] vmacore.dll[0x001C568A]
--> backtrace[01] vmacore.dll[0x0005CA8F]
--> backtrace[02] vmacore.dll[0x0005DBDE]
--> backtrace[03] vmacore.dll[0x001D7405]
--> backtrace[04] vmacore.dll[0x001D74FD]
xxxxxxxxxxxxxxxxxxxxx Cut Logs Here xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
--> backtrace[36] ntdll.dll[0x000154E4]
--> [backtrace end]
-->

The timedFunc back-trace is seen when "Wait For VMware Tools" is set to 0 minutes and 0 seconds

And just about few lines top of this back trace you will see the faulty VM which caused this crash.

You will see something similar to:

2018-01-21T08:37:05.421-05:00 [44764 info 'VmDomain' ctxID=57d5ae61 opID=21076ff:c402:4147:d883] Waiting for VM '[vim.VirtualMachine:b2ab3f04-c72e-43ca-b93d-de1566e4de14:vm-323]' to reach desired powered state 'poweredOff' within '0' seconds.

The VM ID is given here. To find this VM ID you will need to go to the vCenter MOB page.

The way I found out to correlate this is:
1. Login to MOB page for vCenter (https://vcenter-ip/mob)
2. Content > group-d1 (Datacenters)
3. Respective datacenter under "Child Entity"
4. Then under vmFolder group-v4 (vm)
5. Expand childEntity and this will list out all the VMs in that vCenter.

My output was similar to:


The VM was CentOS7.2

> Then navigate to the Recovery plan in SRM
> Select the affected Recovery plan this VM is part of > Related Objects > Virtual Machines
> Right click this VM and select Configure Recovery

Here the Wait For VMware Tools were set to 0,0 timeout. We had to change this to a valid non zero value. 


Post this, the recovery plan completed fine without crashing the SRM service. This should ideally be fixed in the newer SRM releases as it would not let you set a 0 timeout. 

Hope this helps!

Wednesday, 17 January 2018

Resetting Site Recovery Manager's Embedded DB Password

This article is applicable only to Postgres database which is the embedded DB option available during install. If you have forgotten the database password then you will not be able to login to DB to alter tables or perform a repair / modify on the SRM instance.

Before resetting the password, make sure the SRM machine is on a snapshot and a backup is available for the DB.

1. First we will need to edit the pg_hba.conf file to allow all users as trusted users so that a password-less authentication will be performed. The pg_hba.conf file is located under:
C:\ProgramData\VMware\VMware vCenter Site Recovery Manager Embedded Database\data\pg_hba.conf
Make a backup of the file before editing it.

Locate this section in the conf file:

# TYPE DATABASE USER ADDRESS METHOD
# IPv4 local connections:
host all all 127.0.0.1/32 md5
# IPv6 local connections:
host all all ::1/128 md5
# Allow replication connections from localhost, by a user with the
# replication privilege.
#host replication postgres 127.0.0.1/32 md5
#host replication postgres ::1/128 md5

Replace that complete set with this:

# TYPE DATABASE USER ADDRESS METHOD
# IPv4 local connections:
host all all 127.0.0.1/32 trust
# IPv6 local connections:
host all all ::1/128 trust
# Allow replication connections from localhost, by a user with the
# replication privilege.
#host replication postgres 127.0.0.1/32 md5
#host replication postgres ::1/128 md5

Save the file.

2. Restart the SRM Embedded database service from services.msc

3. Open a command prompt in admin mode and now we will have to login to the database. Navigate to the below bin directory:
C:\Program Files\VMware\VMware vCenter Site Recovery Manager Embedded Database\bin

4. Connect to the postgres database using:
psql -U postgres -p 5678

Port might vary if you had a custom install of the database. 

5. Run the below query to change the password:
ALTER USER "enter srm db user here" PASSWORD 'new_password';

The srm db user / port information can be found from the 64 bit ODBC connection. 
A successful execution will return the output: "ALTER ROLE"

6. Revert the changes performed the pg_hba.conf file so that md5 authentication is required for users to login to SRM database. 

7. Restart the SRM Embedded DB service again

Post this, the SRM service will fail to restart and you will notice the following backtrace in vmware-dr.log

2018-01-17T02:25:15.628Z [01748 error 'WindowsService'] Application error:
--> std::exception 'class Vmacore::Exception' "DBManager error: Could not initialize Vdb connection: ODBC error: (08001) - FATAL:  password authentication failed for user "srmadmin"
--> "
--> 
--> [backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.1.1, build: build-3884620, tag: -
--> backtrace[00] vmacore.dll[0x001C568A]
--> backtrace[01] vmacore.dll[0x0005CA8F]
--> backtrace[02] vmacore.dll[0x0005DBDE]
--> backtrace[03] vmacore.dll[0x0001362B]
--> backtrace[04] vmware-dr.exe[0x0015C59A]
--> backtrace[05] MSVCR90.dll[0x00074830]
--> backtrace[06] MSVCR90.dll[0x00043B3C]
--> backtrace[07] ntdll.dll[0x0009CED3]
--> backtrace[08] vmware-dr.exe[0x000060AF]
--> backtrace[09] vmware-dr.exe[0x00006A5E]
--> backtrace[10] windowsService.dll[0x00002BCE]
--> backtrace[11] windowsService.dll[0x000020DD]
--> backtrace[12] sechost.dll[0x000081D5]
--> backtrace[13] KERNEL32.DLL[0x0000168D]
--> backtrace[14] ntdll.dll[0x00074629]
--> [backtrace end]

Run a Modify on the SRM instance from Add/Remove programs and provide the new database password during this process and the service will start up just fine. 

Hope this helps.

Thursday, 11 January 2018

vSphere Replication Sync Fails With Exception: com.vmware.hms.replication.sync.DeltaAbortedException

There are few instances when a vSphere Replication Sync (RPO based or a manual sync) fails with Delta Aborted Exception. This in turn will also affect a test / planned migration when performed with Site Recovery Manager.

In the hms.log located under /opt/vmware/hms/logs on the vSphere Replication Server, you will notice something like:

2018-01-10 14:35:59.950 ERROR com.vmware.hms.replication.sync.ReplicationSyncManager [hms-sync-progress-thread-0] (..replication.sync.ReplicationSyncManager) operationID=fd66efca-f070-429c-bc89-f2164e9dbb7a-HMS-23613 | Completing sync operation because of error: {OnlineSyncOperation, OpId=fd66efca-f070-429c-bc89-f2164e9dbb7a-HMS-23613, GroupMoId=GID-2319507d-e668-4eea-aea9-4d7d241dd886, ExpInstSeqNr=48694, TaskMoId=HTID-56fd57dd-408b-4861-a124-70d8c53a1194, InstanceId=2f900595-2822-4f2b-987d-4361f7035
05c, OpState=started, VcVmMoid=vm-28686, createInstanceRetryCount=2, fullSyncOngoing=false, operationId=null}
com.vmware.hms.replication.sync.DeltaAbortedException
        at com.vmware.hms.replication.sync.SyncOperation.checkHealth(SyncOperation.java:911)
        at com.vmware.hms.replication.sync.SyncOperation$4.run(SyncOperation.java:735)
        at com.vmware.hms.util.executor.LoggerOpIdConfigurator$RunnableWithDiagnosticContext.run(LoggerOpIdConfigurator.java:133)
        at com.vmware.hms.util.executor.LoggerOpIdConfigurator$2.run(LoggerOpIdConfigurator.java:100)
        at com.vmware.jvsl.sessions.net.impl.TlsPreservingWrapper$2.run(TlsPreservingWrapper.java:47)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

This occurs when the outgoingeventlogentity and incomingeventlogentity tables on the vR database are having a large number of entries.

The following fix should be applied at own risk. Have a snapshot and/or a backup of the vR server before performing the change.

1. Navigate to the VRMS database's bin directory:
# cd /opt/vmware/vpostgres/9.x/bin
The postgres version varies depending on the replication server release.

2. Backup the replication database using the below command:
# ./pg_dump -U vrmsdb -Fp -c > /tmp/DBBackup.bak

3. Connect to the vR database using:
# ./psql -U vrmsdb

4. Run the below queries to extract the number of events for the logentity tables:

select count(*) from outgoingeventlogentity; 
select count(*) from incomingeventlogentity;

In my case, the output on the production site vR was:

vrmsdb=# select count(*) from incomingeventlogentity;
 count
-------
 21099
(1 row)

vrmsdb=# select count(*) from outgoingeventlogentity;
 count
-------
   146
(1 row)

And on the recovery site, the outgoingeventlogentity was having 21k+ events. 

5. First, you can change the max event age limit to 10800 in the hms-configuration.xml file located at:
# cd /opt/vmware/hms/conf/hms-configuration.xml

This should be the output after the edit:
<hms-eventlog-maxage>10800</hms-eventlog-maxage>

6. Next, we will have to purge the event logs from the above mentioned tables. There are lot of fields in the table if you run select * from <table-name>; 
The one column we need is the "timestamp" column. 

The timestamp column would have a value like this: 1515479242006
To convert this to human readable date, you will have to:

> Remove the last 3 digits from the above output. 
So 1515479242006 will be 1515479242. Then convert this EPOCH time to normal convention using this link here.

Now, you will have to use a timestamp in such a way that anything before that would be purged from the database. During the purge, the timestamp has to be the complete value obtained from the timestamp column. Then, run the below query:

DELETE from incomingeventlogentity WHERE timestamp < 1515479242006;
DELETE from outgoingeventlogentity WHERE timestamp < 1515479242006;

7. Then restart the hms service using:
# systemctl stop hms
# systemctl start hms

The above is applicable from 6.1.2 vR onward. For lower versions:
# service hms restart

8. Re-pair the sites and then perform a sync now operation and we should be good to go. 

Hope this help!

Wednesday, 6 December 2017

SRM Service Crashes After A Failed Recovery With "abrRecoveryEngine" Backtrace

In some instances, when you are running Array Based Replication for SRM, a failed planned migration might cause the SRM service to crash. In the vmware-dr.log found on the SRM machine, we will notice the following backtrace

2017-12-06T09:55:38.620-05:00 panic vmware-dr[06076] [Originator@6876 sub=Default] 
--> 
--> Panic: Assert Failed: "ok (Dr::Providers::Abr::AbrRecoveryEngine::AbrRecoveryEngineImpl::LoadFromDb: Unable to insert post failover info object 212337205 for group vm-protection-group-121101624 array pair array-pair-7065)" @ d:/build/ob/bora-6014840/srm/src/providers/abr/common/abrRecoveryEngine/abrRecoveryEngine.cpp:244
--> Backtrace:
--> [backtrace begin] product: VMware vCenter Site Recovery Manager, version: 6.5.1, build: build-6014840, tag: vmware-dr, cpu: x86_64, os: windows, buildType: release
--> backtrace[00] vmacore.dll[0x001F29FA]
--> backtrace[01] vmacore.dll[0x00067D60]
--> backtrace[02] vmacore.dll[0x0006A20E]
--> backtrace[03] vmacore.dll[0x002245A7]
--> backtrace[04] vmacore.dll[0x00224771]
--> backtrace[05] vmacore.dll[0x00059C0D]
--> backtrace[06] dr-abr-recoveryEngine.dll[0x00028A91]
--> backtrace[07] dr-abr-recoveryEngine.dll[0x00015199]
--> backtrace[08] dr-abr-recoveryEngine.dll[0x002DB368]
--> backtrace[09] dr-abr-recoveryEngine.dll[0x002DB913]
--> backtrace[10] vmacore.dll[0x001D6ACC]
--> backtrace[11] vmacore.dll[0x001865AB]
--> backtrace[12] vmacore.dll[0x0018759C]
--> backtrace[13] vmacore.dll[0x002202E9]
--> backtrace[14] MSVCR120.dll[0x00024F7F]
--> backtrace[15] MSVCR120.dll[0x00025126]
--> backtrace[16] KERNEL32.DLL[0x000013D2]
--> backtrace[17] ntdll.dll[0x000154E4]
--> [backtrace end]

This is seen when there are issues unmounting the source datastore or demoting the source datastore. 

Disclaimer: Modifying database tables is done by VMware. Do this at your own risk.

The fix is:

1. Make sure SRM service is stopped on both sites
2. Backup the SRM databases on both sites
3. Login to the database either using PGadmin or SQL management studio depending on the type of database used
4. Open this table "pda_grouppostfailoverinfo"
5. Here we need to remove the db_id which is available from the back trace. In my case it is: 212337205
6. Once this is done, start the SRM service. If it crashes again, it usually generates another object ID and repeat the process.

And that should be it.

Thursday, 30 November 2017

Unable To Protect a VM In SRM: "Object not found"

So there's a rare instance where you will be unable to protect a VM and the error it throws out is:
Internal error: class Vmacore::NotFoundException "Object not found"

Under Protection Groups > Related Objects > Virtual Machines, you will see the VM coming up as Not Configured.


And when you try to right click this and say Configure protection, you will notice that the Device Status will come up as Non-replicated 



And if you browse the recovery location and provide the path of the replicated VMDK, you will run into this error.

In the web client logs, you will see:

[2017-11-28T09:27:50.156-06:00] [ERROR] srm-client-thread-1253 70015389 101315 201173 com.vmware.srm.client.infraservice.tasks.FakeTaskImpl [DrVmodlFakeTask:srm-fake-task-11:fake-server-guid]: com.vmware.vim.binding.dr.fault.DrRuntimeFault: Task Failed
at com.vmware.srm.client.infraservice.util.ExceptionUtil.newRuntimeFault(ExceptionUtil.java:92)
at com.vmware.srm.client.infraservice.util.ExceptionUtil.newRuntimeFault(ExceptionUtil.java:68)
at com.vmware.srm.client.infraservice.tasks.MultiTaskProgressUpdaterImpl.getSingleError(MultiTaskProgressUpdaterImpl.java:89)
at com.vmware.srm.client.infraservice.tasks.MultiTaskProgressUpdaterImpl.updateProgress(MultiTaskProgressUpdaterImpl.java:222)
at com.vmware.srm.client.infraservice.tasks.MultiTaskProgressUpdaterImpl$3.run(MultiTaskProgressUpdaterImpl.java:431)
at $java.lang.Runnable$$FastClassByCGLIB$$36fc6471.invoke(<generated>)
at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:149)
at com.vmware.srm.client.topology.impl.osgi.aop.HttpRequestContextAdvice$CallInterceptor.intercept(HttpRequestContextAdvice.java:53)
at com.vmware.srm.client.topology.impl.osgi.aop.HttpRequestContextAdvice$Base$$EnhancerByCGLIB$$b6ab80b4.run(<generated>)
at com.vmware.srm.client.infraservice.tasks.MultiTaskProgressUpdaterImpl$4.run(MultiTaskProgressUpdaterImpl.java:442)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.vmware.vim.binding.dr.fault.InternalError: Internal error: class Vmacore::NotFoundException "Object not found"
[context]zKq7AVMEAQAAAHjHWwAUdm13YXJlLWRyAACoLwpkci1yZXBsaWNhdGlvbi5kbGwAAGEbCgASaT8AAy5BAOv/QACT9EABuSMCY29ubmVjdGlvbi1iYXNlLmRsbAABx3QCAccrAgGg8AABPUMBAccrAgGSLgMBdwgDARb3AgHHKwIBuSMCAXcIAwEW9wIBxysC[/context].
at sun.reflect.GeneratedConstructorAccessor614.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)



The reason, one of them, is the source VMX file has some corrupt or incorrect entries.
So let's have a look at the VM's vmx file.

I will be looking for lines in this file which has a datastore path reference like:

vmx.log.filename = "/vmfs/volumes/58780b1d-045e1100-0efa-0025b5e01a45/Test-1/vmware.log"
sched.swap.derivedName = "/vmfs/volumes/59a30e4d-647fd9f2-2e66-000c295e9f61/Test-1/Test-1-932448b9.vswp"

I have two UUIDs here, 58780b1d-045e1100-0efa-0025b5e01a45 and 59a30e4d-647fd9f2-2e66-000c295e9f61

But, when I run:

[root@Wendy:/vmfs/volumes/59a30e4d-647fd9f2-2e66-000c295e9f61/Test-1] esxcfg-scsidevs -m
mpx.vmhba1:C0:T0:L0:3                                            /vmfs/devices/disks/mpx.vmhba1:C0:T0:L0:3 599ffcb3-d9ece508-7576-000c295e9f61  0  Wendy-Local
mpx.vmhba1:C0:T1:L0:1                                            /vmfs/devices/disks/mpx.vmhba1:C0:T1:L0:1 59a30e4d-647fd9f2-2e66-000c295e9f61  0  VDP-Storage

I just have these two UUIDs which do not match the one's in the VMX file. So these incorrect references are causing this drive status to be non replicated in turn causing issues with VM protection.
You might have one or more such entries in the VMX file. 

Power off the virtual machine on source and then backup the VMX file and edit it to provide the UUID of the datastore where the VM resides / the appropriate UUID where the respective files should reside. In my case the Test-1 VM runs on VDP-Storage, which is 59a30e4d-647fd9f2-2e66-000c295e9f61

So the new VMX entry looks as:

vmx.log.filename = "/vmfs/volumes/59a30e4d-647fd9f2-2e66-000c295e9f61/Test-1/vmware.log"
sched.swap.derivedName = "/vmfs/volumes/59a30e4d-647fd9f2-2e66-000c295e9f61/Test-1/Test-1-932448b9.vswp"

Reload the VMX using:

# vim-cmd vmsvc/reload <vm-id>

The vm-id can be obtained from

# vim-cmd vmsvc/getallvms

Then Power on the VM and then right click the VM in protection group and configure recovery, this time the hard drive status will be displayed as replicated.


And that's pretty much it. Usually this is seen, when vmware.log files are configured to a different datastore and that particular datastore is no longer available.

Hope this helps.

Tuesday, 25 April 2017

Unable To Reinstall SRM With Existing Database "The selected vCenter server does not match the one used in the previous installation"

There are scenarios where you will have to reinstall a vCenter site. When a vCenter site is reinstalled the SRM solution connected to it has to be reinstalled. The reinstall is usually done with an existing database is because you would like to save your protection groups and recovery plans without the need to recreate them all. However, this reinstall with existing database would fail with the following error:

The selected vCenter server does not match the one used in the previous installation

This is because, every vCenter has a vCenter UUID with it. In this case, the vCenter being used was 6.0 and SRM was 6.1.

If you look at the instance.cfg file. It is located under:
Appliance:
/etc/vmware-vpx

Windows:
Installation directory/VMware/vCenterServer/vmware-vpx/cfg

In this instance.cfg you will have the vCenter UUID.
instanceUuid=c4c2202e-4acd-4b35-a2da-0947a3429658

When SRM is registered to vCenter, this instanceUUid is saved in the SRM database. Now, when the vCenter is reinstalled, you will have a new instanceUuid which does not match the one stored in SRM Database and the installation fails.

So, you will have to manually update the SRM database with this instanceUuid prior to installation.

The table you need to look into SRM is pd_localsite. If you connect to your DB and view this table, you will see something as below. I am using Embedded Postgres for SRM, so the query would be:

select * from pd_localsite;

The output:
 db_id |    mo_id    | ref_count |            name             | vcaddress | vcport | vccertthumbprint |                 uuid                 |                        siteurl                        |            vcinstanceuuid            |    domain    |                                  sucertificate
path
-------+-------------+-----------+-----------------------------+-----------+--------+------------------+--------------------------------------+-------------------------------------------------------+--------------------------------------+--------------+-----------------------------------------------
---------------------------------------
     1 | DrLocalSite |         1 | vcenter-prod.happycow.local |           |      0 |                  | 0a063f18-a09c-4126-9556-db5bd30c37dd | https://psc-prod.happycow.local:443/lookupservice/sdk | c4c2202e-4acd-4b35-a2da-0947a3429658 | vmware.local | C:\Program Files\VMware\VMware vCenter Site Re
covery Manager\bin\10.109.10.164su.p12
(1 row)

So the vcinstanceuuid needs to be matching your instanceUuid of instance.cfg 

Once this is done, proceed with the SRM installation and it will work just fine. 

Sunday, 8 January 2017

Part 5: Creating Recovery Plans In SRM 6.1

Part 4: Creating protection groups for virtual machines in SRM 6.1

Once you create a protection group, it's time to create a recovery plan. When you want to perform a DR test or a test recovery, it is the recovery plan that you will execute. A recovery plan is tasked to run a set of steps in a particular order to fail over the VMs or test the failover to the recovery site. You cannot change the workflow of the recovery plan, however you can customize by adding your required checks and tasks in between.

Select the production site in SRM inventory and under Summary Tab select Create a recovery plan.


Provide a name for the recovery plan and an optional description and click Next.


Select the recovery site where you want the VMs to failover to and click Next.


The Group type will be VM protection groups and then select the required protection groups to be added to this recovery plan. Only the VMs in the protection group added to the recovery plan will be failed over in an event of disaster. Click Next.


We have something called as Test Recovery. Test recovery does a test failover of the protected VMs to the recovery site without impacting the production VMs working or network identity. A test network or a bubble network (A network with no uplinks) will be created on the recovery site and these VMs will be placed there and bough up to verify if the recovery plan is working good. Keep the default auto create settings and click Next.


Review your recovery plan settings and click Finish to complete the create recovery plan wizard.


If you select the protected site, Related Objects and Recovery plans you can see this recovery plan being listed.


If you select the Recovery Plans in the Site Recovery Inventory, you will see the status of the plan and their related details.


Before you test your recovery, you will have to configure this recovery plan. Browse to, Recovery Plans, Related Objects, Virtual Machines. The VMs available under this recovery plan will be listed. Right click the virtual machine and select Configure Recovery


There are two options here, Recovery properties and IP customization.

The recovery properties discusses the order of VM startup, VM dependencies and additional steps that has to be carried out during and after Power On.

Since I just have one virtual machine in this recovery plan, the priority and the dependencies does not really matter. Set these options as to your requirement.


In the IP Customization option, you will provide the network details for the virtual machine in the Primary and the Recovery Site.


Select Configure Protection and you will be asked to configure IP settings of the VM in protected site. If you have VM tools running on this machine (Recommended), then click Retrieve and it will auto populate the IP settings. Click DNS option and enter the DNS IP and the domain name manually. Click OK to complete. The same steps has to be performed in the Recovery Site too under Configure Recovery, however, all the IP details has to be entered manually (If DHCP is not used) since there are no VM tools or powered On VM on the recovery site.


Once both are complete, you should see the below information in the IP Customization section. Click OK to finish configuring VM recovery.


Once this is performed for all the virtual machines in the recovery plan, the plan customization is complete and ready to be tested. You can also use the DR IP Customization tool to configure VM recovery settings.

In the next article, we will have a look at testing a recovery plan.