Showing posts with label vSphere Replication. Show all posts
Showing posts with label vSphere Replication. Show all posts

Sunday, 8 July 2018

Unable To Pair SRM Sites: "Server certificate chain not verified"

So first things first, as of this post, you might know that I have moved out of VMware and ventured further into backup and recovery solutions domain. Currently, I work as a solutions engineer at Rubrik.

There are a lot of instances where you are unable to manage anything in Site Recovery Manager; regardless of the version (Also, applicable to vSphere Replication) and the common error that pops up on the bottom right of the web client is Server certificate chain not verified

Failed to connect to vCenter Server at vCenter_FQDN:443/sdk. Reason:
com.vmware.vim.vmomi.core.exception CertificateValidationException: Server certificate chain not verified.

This article will briefly explain only on the embedded Platform Services Controller deployment model. Similar logic needs to be extrapolated to the external deployments.

These issues are ideally seen when:
> PSC is migrated from embedded to external
> Certificates are replaced for the vCenter

I will be simplifying this KB article here for reference. Before proceeding have a powered off snapshot of the PSC and vCenters involved. 

So, for embedded deployment of VCSA:

1. SSH into the VCSA and run the below command:
# /usr/lib/vmidentity/tools/scripts/lstool.py list --url https://localhost/lookupservice/sdk --no-check-cert --ep-type com.vmware.cis.cs.identity.sso 2>/dev/null

This command will give you the ssl Trust that is currently stored on your PSC. Now, consider you are using an embedded PSC deployment on production and another embedded deployment in DR (No Enhanced Linked Mode). In this case, when you run the above command, you are expected to see just one single output where the URL section is the FQDN of your current PSC node and associated with it would be its current ssl trust. 

URL: https://current-psc.vmware.local/sts/STSService/vsphere.local
SSL trust: MIIDWDCCAkCgAwIBAgIJANr+++MJ5+WxMA0GCSqGSIb3DQEBCwUAMGgxCzAJBgNV
...Reducing output...
LqSKWg/apP1XlBV1VyC5LPZrH/rrq8+Naoj7i/P6HAzTwAAL+O10ggClaP8=

If this is your case, proceed to step (2), if not jump to step (3)

2. Run the next command:
# echo | openssl s_client -connect localhost:443

This is the current ssl Trust that is used by your deployment post the certificate replacement. Here look at the extract which speaks about the certificate chain.

Server certificate
-----BEGIN CERTIFICATE-----
MIIDWDCCyAHikleBAgIJANr+++MJ5+WxMA0GCSqGSIb3DQEBCwUAMGgxCzAJBgNV
...
LqSKWg/apP1XlBV1VyC5LPZrH/rrq8+Naoj7i/P6HAzTwAAL+O10fGhhDDqm=
-----END CERTIFICATE-----

So from here, the chain obtained from the first command (current ssl trust in PSC) does not match the chain from the second command (Actual ssl trust). And due to this mismatch you would see the chain not verified message in the UI.

To fix this, the logic is; find all the services using the thumbprint of the old ssl Trust (step 1) and update them with the thumbprint from step 2. The steps in KB article is a bit confusing, so this is what I follow to fix it.

A) Copy the SSL trust you obtained from the first command to Notepad++ Everything that starts from MIID... in my case (No need to include SSL Trust option in it).

B) The chain should contain 65 characters in one line. So in notepad++ place the mouse after a character and see what the col option reads at the bottom. Hit Enter at the mark when col: 65
Format this for the complete chain (The last line might have <65 characters which is okay)

C) Append -----BEGIN CERTIFICATE----- and -----END CERTIFICATE----- before and after the chain (5 hyphens are used before and after)

D) Save the Notepadd++ document as a .cer extension.

E) Open the certificate file that you saved and navigate to Details > Thumbprint. You will notice a string of hexa with spacing after every 2 characters. Copy this to a Notepadd++ and append : after every 2 characters, so you will end up with the thumbprint similar to: 13:1E:60:93:E4:E6:59:31:55:EB:74:51:67:2A:99:F8:3F:04:83:88

F) Next, we will export the current certificate using the below command
# /usr/lib/vmware-vmafd/bin/vecs-cli entry getcert --store MACHINE_SSL_CERT --alias __MACHINE_CERT --output /certificates/new_machine.crt

This will export the current certificate to the /certificates directory. 

G) Run the update thumbprint option using the below command:
# cd /usr/lib/vmidentity/tools/scripts/
# python ls_update_certs.py --url https://FQDN_of_Platform_Services_Controller/lookupservice.sdk --fingerprint Old_Certificate_Fingerprint --certfile New_Certificate_Path_from_/Certificates --user Administrator@vsphere.local --password 'Password'

So a sample command would be:
# python ls_update_certs.py --url https://vcsa.vmware.local --fingerprint 13:1E:60:93:E4:E6:59:31:55:EB:74:51:67:2A:99:F8:3F:04:83:88 --certfile /certificates/new_machine.crt --user Administrator@vsphere.local --password 'Password'

This will basically look for all the services using the old thumbprint (04:83:88) and then update them with the current thumbprint from the new_machine.crt

Note: Once you paste the thumbprint in the SSH, remove any extra spaces before and after the beginning and end of thumbprint respectively. I have seen the Update service fail because of this as it picks up some special character in few cases (Special characters that you would not see in the terminal) So remove the space after fingerprint and re-add the space back. Do the same before the --certfile switch too.

Re-run the commands in step 1 and 2 and the SSL trust should now match. If yes, then re-login back to the web client and you should be good to go. 

--------------------------------------------------------------------------------------------------------------------------
3. In this case, you might see two PSC URL outputs in the step 1 command. Both the PSC have the same URL, and it might or might not have the same SSL trust. 

Case 1:
Multiple URL with different SSL trust.

This would be a easy one. On the output from Step 1, you will see two outputs with same PSC URL, but different SSL trust. And one of the SSL trust from here will match the current certificate from step 2. So this means, the one that does not match is the stale one and can be removed from the STS store. 

You can remove them from the CLI, however, I stick to using Jxplorer tool to remove it from the GUI. You can connect to PSC from Jxplorer using this KB article here.

Once connected, navigate to Configuration > Sites > LookupService > Service Registrations. 
One of the fields from command in step 1 is Service ID. Which is something similar to:
Service ID: 04608398-1493-4482-881b-b35961bf5141

Locate this similar service ID in the service registrations and you should be good to remove it. 

Case 2:
Multiple URL with same SSL trust. 

In this case, after the output from step 1, you will see two same PSC URL along with the same SSL trust. And these might or might not match the output from step 2. 

The first step of this fix is:

Note down both of the service IDs from the output of step 1. Connect the Jxplorer as mentioned above. Select the service ID and on the right side, click Table Editor view and click submit. You can view the last modified date of this service registration. The service ID having the older last modified date would be the stale registration and can be removed via Jxplorer. Now, when you run the command from Step 1, it would have one output. If this matches the thumbprint from step 2, great! If not, then an additional step of updating the thumbprint needs to be performed. 

In an event of external PSC deployment, let's say one PSC in production site and one in recovery site in ELM, then the command from step 1 is supposed to populate two outputs with two different URL (production and DR PSC) since they are replicating. This will of course change if there are multiple PSCs replicating with or without a load balancer. The process would be too complex to explain using text, so in this event it would be best to involve VMware Support for assistance. 

Hope this helps!

Tuesday, 15 May 2018

Bad Exit Code: 1 During Upgrade Of vSphere Replication To 8.1

With the release of 8.1 vSphere replication comes a ton of new upgrade and deployment issues. The one common issue is the Bad Exit Code: 1 error during the upgrade phase. This is valid for 6.1.2 or 6.5.x to 8.1 upgrade.

The first thing you will notice in the GUI is the following error message.


If you retry the upgrade will still fail and if you Ignore, the upgrade will proceed but then you will notice during during the configuration section.


Only after a "successful" failed upgrade we can access the logs to see what's the issue.

There is a log called hms-boot.log which records all these information and can be found under /opt/vmware/hms/logs

Here, the first error was this:

----------------------------------------------------
# Upgrade Services
Stopping hms service ... OK
Stopping vcta service ... OK
Stopping hbr service ... OK
Downloading file [/opt/vmware/hms/conf/hms-configuration.xml] to [/opt/vmware/upgrade/oldvr] ...Failure during upgrade procedure at Upgrade Services phase: java.io.IOException: inputstream is closed

com.jcraft.jsch.JSchException: java.io.IOException: inputstream is closed
        at com.jcraft.jsch.ChannelSftp.start(ChannelSftp.java:315)
        at com.jcraft.jsch.Channel.connect(Channel.java:152)
        at com.jcraft.jsch.Channel.connect(Channel.java:145)
        at com.vmware.hms.apps.util.upgrade.SshUtil.getSftpChannel(SshUtil.java:66)
        at com.vmware.hms.apps.util.upgrade.SshUtil.downloadFile(SshUtil.java:88)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.downloadConfigFiles(Vr81MigrationUpgradeWorkflow.java:578)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.lambda$compileUpgrade$3(Vr81MigrationUpgradeWorkflow.java:1222)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.run(Vr81MigrationUpgradeWorkflow.java:519)
        at com.vmware.jvsl.run.VlsiRunnable$1$1.run(VlsiRunnable.java:111)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.VlsiRunnable$1.run(VlsiRunnable.java:104)
        at com.vmware.jvsl.run.ExecutorRunnable.withExecutor(ExecutorRunnable.java:17)
        at com.vmware.jvsl.run.VlsiRunnable.withClient(VlsiRunnable.java:98)
        at com.vmware.jvsl.run.VcRunnable.withVc(VcRunnable.java:139)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.launchMigrationUpgrade(Vr81MigrationUpgrade.java:62)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.access$100(Vr81MigrationUpgrade.java:21)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade$1.run(Vr81MigrationUpgrade.java:51)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.run(Vr81MigrationUpgrade.java:46)
        at com.vmware.hms.apps.util.App.run(App.java:89)
        at com.vmware.hms.apps.util.App$1.run(App.java:122)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable$1.run(ExceptionHandlerRunnable.java:47)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable.withExceptionHandler(ExceptionHandlerRunnable.java:43)
        at com.vmware.hms.apps.util.App.main(App.java:118)
Caused by: java.io.IOException: inputstream is closed
        at com.jcraft.jsch.ChannelSftp.fill(ChannelSftp.java:2911)
        at com.jcraft.jsch.ChannelSftp.header(ChannelSftp.java:2935)
        at com.jcraft.jsch.ChannelSftp.start(ChannelSftp.java:262)
        ... 24 more

Then when I proceeded with an ignore, the error was this:

# Reconfigure VR
Failure during upgrade procedure at Reconfigure VR phase: null

java.lang.NullPointerException
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.vrReconfig(Vr81MigrationUpgradeWorkflow.java:1031)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.lambda$compileUpgrade$5(Vr81MigrationUpgradeWorkflow.java:1253)
        at com.vmware.hms.apps.util.upgrade.Vr81MigrationUpgradeWorkflow.run(Vr81MigrationUpgradeWorkflow.java:519)
        at com.vmware.jvsl.run.VlsiRunnable$1$1.run(VlsiRunnable.java:111)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.VlsiRunnable$1.run(VlsiRunnable.java:104)
        at com.vmware.jvsl.run.ExecutorRunnable.withExecutor(ExecutorRunnable.java:17)
        at com.vmware.jvsl.run.VlsiRunnable.withClient(VlsiRunnable.java:98)
        at com.vmware.jvsl.run.VcRunnable.withVc(VcRunnable.java:139)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.launchMigrationUpgrade(Vr81MigrationUpgrade.java:62)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.access$100(Vr81MigrationUpgrade.java:21)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade$1.run(Vr81MigrationUpgrade.java:51)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.hms.apps.util.Vr81MigrationUpgrade.run(Vr81MigrationUpgrade.java:46)
        at com.vmware.hms.apps.util.App.run(App.java:89)
        at com.vmware.hms.apps.util.App$1.run(App.java:122)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable$1.run(ExceptionHandlerRunnable.java:47)
        at com.vmware.jvsl.run.CheckedRunnable.withoutChecked(CheckedRunnable.java:19)
        at com.vmware.jvsl.run.ExceptionHandlerRunnable.withExceptionHandler(ExceptionHandlerRunnable.java:43)
        at com.vmware.hms.apps.util.App.main(App.java:118)

When we still proceeded with ignore, the last stack was this:

Initialization error: Bad exit code: 1
Traceback (most recent call last):
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 178, in main
    __ROUTINES__[name]()
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 86, in func
    return fn(*args)
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 86, in func
    return fn(*args)
  File "/opt/vmware/share/htdocs/service/hms/cgi/boot.py", line 714, in get_default_sitename
    ovf.hms_cache_sitename()
  File "/opt/vmware/share/htdocs/service/hms/cgi/ovf.py", line 686, in hms_cache_sitename
    cache_f.write(hms_get_sitename(ext_key, jks, passwd, alias))
  File "/opt/vmware/share/htdocs/service/hms/cgi/ovf.py", line 679, in hms_get_sitename
    ext_key, jks, passwd, alias
  File "/opt/vmware/share/htdocs/service/hms/cgi/ovf.py", line 412, in get_sitename
    output = commands.execute(cmd, None, __HMS_HOME__)[0]
  File "/opt/vmware/share/htdocs/service/hms/cgi/commands.py", line 324, in execute
    raise Exception('Bad exit code: %d' % proc.returncode)
Exception: Bad exit code: 1

So it looks like there is any issue with the copy of files from the old vR server to the new one. In the sshd_config file under /etc/ssh/ on the old vR server, the following was an entry:

Subsystem sftp /usr/lib64/ssh/sftp-server

Edit this line, so it will be:
Subsystem sftp /usr/lib/ssh/sftp-server

Then retry the Upgrade by deploying a fresh 8.1 and going through the "upgrade" process again and this time it should complete successfully.

Hope this helps!

Upgrading vSphere Replication From 6.5 To 8.1

With the release of vSphere Replication 8.1, the upgrade path is not how it was earlier. The 8.1 vR server now hosts a PhotonOS and the upgrade is similar to a vCenter migration. In this case, you will deploy a new 8.1 vR server via the OVF template with a temporary IP and then follow a series of upgrade / migrate steps to transfer data from the old vR server to the new one.

1. Proceed with the regular deployment of the vSphere replication appliance, where you download the 8.1 ISO, mount it on a windows server and choose the support.vmdk, system.vmdk, certificate, manifest and the OVF file for deployment. A temporary IP is needed for the appliance to be on network.

2. Once the deployment is done, power on the 8.1 appliance and open a VM console. During the boot you will be presented with the below options.


The 192.168.1.110 is my 6.5 vSphere Replication appliance and it was already registered to the vCenter server. Select the Option 3 to proceed with the Upgrade.

NOTE: For Bad Exit Code 1 error during upgrade, refer this article here.

3. Provide in the root password of the old replication server to proceed.


4. The upgrade process begins to install the necessary RPMs. This might take about 10 minutes to complete.


5. You will then be prompted to enter the SSO user name of the corresponding vCenter this vR is registered to and it's password.


6.  Post a few configuration progress in the window, the upgrade is done and you will be presented with the 8.1 banner page.


That should be it. Hope this helps!

Tuesday, 27 March 2018

Embedded Replication Server Disconnected In vSphere Replication 5.8

A vSphere replication server comes with an embedded replication service to manage all the traffic and vR queries in addition to an option of deploying add on servers. In 5.8 or older vSphere replication servers, there are scenarios where this embedded replication server is displayed as disconnected. Since this embedded service is disconnected, the replications will be in RPO violation state as the replication traffic is not manageable.

In the hbrsrv.log on the vSphere replication appliance, located under /var/log/vmware, we see the below:

repl:/var/log/vmware # grep -i "link-local" hbrsrv*

hbrsrv-402.log:2018-03-23T11:25:24.914Z [7F70AC62E720 info 'HostCreds' opID=hs-init-1d08f1ab] Ignoring link-local address for host-50: "fe80::be30:5bff:fed9:7c52"

hbrsrv.log:2018-03-23T11:25:24.914Z [7F70AC62E720 info 'HostCreds' opID=hs-init-1d08f1ab] Ignoring link-local address for host-50: "fe80::be30:5bff:fed9:7c52"

So, this is seen when the VMs being replicated are on an ESX host which has IPv6 link local address enabled and the host is using an IPv4 addressing. 

The logs, here speak in terms on host MoID, so you can find out the host name from the vCenter MOB page, https://<vcenter-ip/mob

To navigate to the host MoID section:

Content > group-d1 (Datacenters) > (Your datacenter) under childEntity > group-xx under hostFolder > domain-xx (Under childEntity) > locate the host ID

Then using this hostname, disable the IPv6 on the referenced ESX:
> Select the ESXi 
> Select Configuration
> Select Networking
> Edit Settings for vmk0 (Management) port group
> IP Address, Un-check IPv6

Then reboot that ESX host. Repeat the steps for the remaining ESX too and then finally reboot the vSphere Replication Appliance. 

Now, there should no longer be link-local logging in hbrsrv.log and the embedded server should be connected allowing the RPO syncs to resume.

Hope this helps!

Friday, 9 February 2018

vSphere Replication Jobs Fail Due to NFC_NO_MEMORY

There might be instances where the replication will run into an Error State or RPO violation state with NFC errors. When you click on the vCenter object in web client and navigate to the summary tab you can view the list of issues and when you highlight the vSphere Replication issues you will notice the NFC errors.

You will notice the below in the logs.
Note: The GID and other values will be different for each environment.

In source ESX where the virtual machine having issue is hosted, you will notice the below in vmkernel.log

2018-02-09T12:07:02.728Z cpu2:3055234)Hbr: 2998: Command: INIT_SESSION: error result=Failed gen=-1: Error for (datastoreUUID: "4723769b-f34bce3e"), (diskId: "RDID-0aaaa0e1-66e1-447f-97f5-19072c00d01e"), (hostId: "host-575"), (pathname: "Test-VM/hbrdis$
2018-02-09T12:07:02.728Z cpu2:3055234)WARNING: Hbr: 3007: Command INIT_SESSION failed (result=Failed) (isFatal=FALSE) (Id=0) (GroupID=GID-e62e7093-bca9-4f51-9e87-75f17c80bdf6)
2018-02-09T12:07:02.728Z cpu2:3055234)WARNING: Hbr: 4570: Failed to establish connection to [10.254.2.37]:31031(groupID=GID-e62e7093-bca9-4f51-9e87-75f17c80bdf6): Failure

In the hbrsrv.log under /var/log/vmware you will notice:

2018-02-09T13:12:17.024+01:00 warning hbrsrv[7FF152B01700] [Originator@6876 sub=Libs] [NFC ERROR] NfcFssrvrClientOpen: received unexpected message 4 from server
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-525.
2018-02-09T13:12:17.024+01:00 verbose hbrsrv[7FF152B01700] [Originator@6876 sub=HostPicker] AffinityHostPicker forgetting host affinity for context '[] /vmfs/volumes/4723769b-f34bce3e/Test-VM2'
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main] HbrError for (datastoreUUID: "4723769b-f34bce3e"), (hostId: "host-525"), (pathname: "Test-VM2/Tes-VM2.vmdk"), (flags: retriable, pick-new-host) stack:
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [0] Class: NFC Code: 8
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [1] NFC error: NFC_SESSION_ERROR
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [2] Code set to: Host unable to process request.
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [3] Set error flag: retriable
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [4] Set error flag: pick-new-host
2018-02-09T13:12:17.024+01:00 info hbrsrv[7FF152B01700] [Originator@6876 sub=Main]    [5] Can't open remote disk /vmfs/volumes/4723769b-f34bce3e/Test-VM2/Test-VM2.vmdk

Now, you can run the below command to check if there is one affected host or multiple:
# grep -i "Destroying NFC connection" /var/log/vmware/hbrsrv.log | awk '{ $1="";print}' | sort -u

This will give you the list of host MoID. Something like this, neatly sorted out:

 info hbrsrv[7FF152A7F700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-433.
 info hbrsrv[7FF152A7F700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-660.
 info hbrsrv[7FF1531E6700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-433.
 info hbrsrv[7FF153227700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-352.
 info hbrsrv[7FF153227700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-390.
 info hbrsrv[7FF153227700] [Originator@6876 sub=StorageManager] Destroying NFC connection to host-487.

Then use this host-id to see which host name it corresponds to from the vCenter MOB page. 

Then on that affected host, you will see this in the hostd.log

2018-02-09T12:17:21.339Z info hostd[4D4C1B70] [Originator@6876 sub=Libs] NfcServerProcessClientMsg: Authenticity of the NFC client verified.
2018-02-09T12:17:21.399Z info hostd[4B040B70] [Originator@6876 sub=Nfcsvc] PROXY connection to NFC(useSSL=0): found session ticket:[N9VimShared15NfcSystemTicketE:0x4c224d24]
2018-02-09T12:17:21.399Z info hostd[4D4C1B70] [Originator@6876 sub=Nfcsvc] Successfully initialized nfc callback for a  write to the socket to be invoked on a separate thread
2018-02-09T12:17:21.399Z info hostd[4D4C1B70] [Originator@6876 sub=Nfcsvc] Plugin started
2018-02-09T12:17:21.399Z info hostd[4D4C1B70] [Originator@6876 sub=Libs] NfcServerProcessClientMsg: Authenticity of the NFC client verified.
2018-02-09T12:17:21.448Z warning hostd[4D4C1B70] [Originator@6876 sub=Libs] [NFC ERROR] NfcCheckAndReserveMem: Cannot allocate any more memory as NFC is already using 50331560 and allocating 119 will make it more than the maximum allocated: 50331648. Please close some sessions and try again
2018-02-09T12:17:21.448Z warning hostd[4D4C1B70] [Originator@6876 sub=Libs] [NFC ERROR] NfcProcessStreamMsg: fssrvr failed with NFC error code = 5
2018-02-09T12:17:21.448Z error hostd[4D4C1B70] [Originator@6876 sub=Nfcsvc] Read error from the nfcLib: NFC_NO_MEMORY (done=yep)

To fix this, you will need to increase the hostd NFC memory on the target affected ESX host.

1. SSH to the host and navigate to the below location;
# etc/vmware/hostd/config.xml

You will want the following snippet (backup the file before edit)

<nfcsvc>
    <path>libnfcsvc.so</path>
    <enabled>true</enabled>
    <maxMemory>50331648</maxMemory>
    <maxStreamMemory>10485760</maxStreamMemory>
</nfcsvc>

So here change value for maxMemory to 62914560

So after edit:

<nfcsvc>
    <path>libnfcsvc.so</path>
    <enabled>true</enabled>
    <maxMemory>62914560</maxMemory>
    <maxStreamMemory>10485760</maxStreamMemory>
</nfcsvc>

2. Restart the hostd service using:
# /etc/init.d/hostd restart

3. Then initiate a force sync on the replication and it should resume successfully.

Hope this helps!

Monday, 15 January 2018

vSphere Replication 6.5.1 With vRealize Orchestrator 7.3

Here we will be looking into how to configure and use vSphere replication with vRealize Orchestrator. The version of my setup is:

vCenter Appliance 6.5 U1
vSphere Replication 6.5.1
vRealize Orchestrator 7.3

In brief, deploy the vRealize Orchestrator OVA template. Then navigate to "https://<vro-fqdn>:8283/vco-controlcenter/" to begin the configuration.

I have a standalone Orchestrator deployment with vSphere Authentication mode.


SSO user name and password is required to complete the registration. A restart of vRO would be needed to complete the configuration.


Next, download the vSphere Replication vmoapp file from this link here.

To install this file, click on the Manage Plugins tab in the Orchestrator control center and browse for the downloaded vmoapp file.


Then accept the EULA to Install the Plugin.


If prompted, click Save Changes and this should show the vR plugin is available and enabled in the plugin page.


Next, register the vCenter Site for the replication server using the below "Register VC Site" Workflow. All the next tasks are done from the Orchestrator client.


Once done, you can verify the vSphere Replication site is now visible under Administer mode of vRO.


Next, we will configure replication for one virtual machine. With the Run mode execute the "Configure Replication" workflow.

The Site (source) will be selected first.


Selecting virtual machine will be the next task.


Target site vR selection will be next. I am replicating within the same vCenter, so the source and target vR site is the same machine.


Next, we will select the target datastore where the replicated files should reside.


Lastly, we will choose the RPO and other required parameters to complete the replication task and click Submit.


Finally, you can see the VM under Outgoing Replication tab for vCenter.


That's pretty much it!

Thursday, 11 January 2018

vSphere Replication Sync Fails With Exception: com.vmware.hms.replication.sync.DeltaAbortedException

There are few instances when a vSphere Replication Sync (RPO based or a manual sync) fails with Delta Aborted Exception. This in turn will also affect a test / planned migration when performed with Site Recovery Manager.

In the hms.log located under /opt/vmware/hms/logs on the vSphere Replication Server, you will notice something like:

2018-01-10 14:35:59.950 ERROR com.vmware.hms.replication.sync.ReplicationSyncManager [hms-sync-progress-thread-0] (..replication.sync.ReplicationSyncManager) operationID=fd66efca-f070-429c-bc89-f2164e9dbb7a-HMS-23613 | Completing sync operation because of error: {OnlineSyncOperation, OpId=fd66efca-f070-429c-bc89-f2164e9dbb7a-HMS-23613, GroupMoId=GID-2319507d-e668-4eea-aea9-4d7d241dd886, ExpInstSeqNr=48694, TaskMoId=HTID-56fd57dd-408b-4861-a124-70d8c53a1194, InstanceId=2f900595-2822-4f2b-987d-4361f7035
05c, OpState=started, VcVmMoid=vm-28686, createInstanceRetryCount=2, fullSyncOngoing=false, operationId=null}
com.vmware.hms.replication.sync.DeltaAbortedException
        at com.vmware.hms.replication.sync.SyncOperation.checkHealth(SyncOperation.java:911)
        at com.vmware.hms.replication.sync.SyncOperation$4.run(SyncOperation.java:735)
        at com.vmware.hms.util.executor.LoggerOpIdConfigurator$RunnableWithDiagnosticContext.run(LoggerOpIdConfigurator.java:133)
        at com.vmware.hms.util.executor.LoggerOpIdConfigurator$2.run(LoggerOpIdConfigurator.java:100)
        at com.vmware.jvsl.sessions.net.impl.TlsPreservingWrapper$2.run(TlsPreservingWrapper.java:47)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

This occurs when the outgoingeventlogentity and incomingeventlogentity tables on the vR database are having a large number of entries.

The following fix should be applied at own risk. Have a snapshot and/or a backup of the vR server before performing the change.

1. Navigate to the VRMS database's bin directory:
# cd /opt/vmware/vpostgres/9.x/bin
The postgres version varies depending on the replication server release.

2. Backup the replication database using the below command:
# ./pg_dump -U vrmsdb -Fp -c > /tmp/DBBackup.bak

3. Connect to the vR database using:
# ./psql -U vrmsdb

4. Run the below queries to extract the number of events for the logentity tables:

select count(*) from outgoingeventlogentity; 
select count(*) from incomingeventlogentity;

In my case, the output on the production site vR was:

vrmsdb=# select count(*) from incomingeventlogentity;
 count
-------
 21099
(1 row)

vrmsdb=# select count(*) from outgoingeventlogentity;
 count
-------
   146
(1 row)

And on the recovery site, the outgoingeventlogentity was having 21k+ events. 

5. First, you can change the max event age limit to 10800 in the hms-configuration.xml file located at:
# cd /opt/vmware/hms/conf/hms-configuration.xml

This should be the output after the edit:
<hms-eventlog-maxage>10800</hms-eventlog-maxage>

6. Next, we will have to purge the event logs from the above mentioned tables. There are lot of fields in the table if you run select * from <table-name>; 
The one column we need is the "timestamp" column. 

The timestamp column would have a value like this: 1515479242006
To convert this to human readable date, you will have to:

> Remove the last 3 digits from the above output. 
So 1515479242006 will be 1515479242. Then convert this EPOCH time to normal convention using this link here.

Now, you will have to use a timestamp in such a way that anything before that would be purged from the database. During the purge, the timestamp has to be the complete value obtained from the timestamp column. Then, run the below query:

DELETE from incomingeventlogentity WHERE timestamp < 1515479242006;
DELETE from outgoingeventlogentity WHERE timestamp < 1515479242006;

7. Then restart the hms service using:
# systemctl stop hms
# systemctl start hms

The above is applicable from 6.1.2 vR onward. For lower versions:
# service hms restart

8. Re-pair the sites and then perform a sync now operation and we should be good to go. 

Hope this help!

Monday, 28 August 2017

Bash Script To Extract vSphere Replication Job Information

Below is one bash script that extracts information about replication for configured VMs. It displays, the name of the virtual machine, if yes or no for quiesce Guest OS and Network Compression. Then it tabulates RPO (in minutes) as "bc" is unsupported on vR SUSE to perform hour floating calculations and then the datastore MoRef ID.

The complete updated script can be accessed from my GitHub Repo:
https://github.com/happycow92/shellscripts/blob/master/vR-jobs.sh

As and when I add more or reformat the information the script in the link will be updated.

#!/bin/bash
clear
echo -e " -----------------------------------------------------------------------------------------------------------"
echo -e "| Virtual Machine | Network Compression | Quiesce | RPO | Datastore MoRef ID |"
echo -e " -----------------------------------------------------------------------------------------------------------"
cd /opt/vmware/vpostgres/9.3/bin
./psql -U vrmsdb << EOF
\o /tmp/info.txt
select name from groupentity;
select networkcompressionenabled from groupentity;
select rpo from groupentity;
select quiesceguestenabled from groupentity;
select configfilesdatastoremoid from virtualmachineentity;
EOF
cd /tmp
name_array=($(awk '/name/{i=1;next}/ro*/{i=0}{if (i==1){i++;next}}i' info.txt))
quiesce_array=($(awk '/networkcompressionenabled/{i=1;next}/ro*/{i=0}{if (i==1){i++;next}}i' info.txt))
compression_array=($(awk '/quiesceguestenabled/{i=1;next}/ro*/{i=0}{if (i==1){i++;next}}i' info.txt))
rpo_array=($(awk '/rpo/{i=1;next}/ro*/{i=0}{if (i==1){i++;next}}i' info.txt))
datastore_array=($(awk '/configfilesdatastoremoid/{i=1;next}/ro/{i=0} {if (i==1){i++;next}}i' info.txt))
length=${#name_array[@]}
for ((i=0;i<$length;i++));
do
printf "| %-32s | %-23s | %-10s | %-10s| %-20s|\n" "${name_array[$i]}" "${quiesce_array[$i]}" "${compression_array[$i]}" "${rpo_array[$i]}" "${datastore_array[$i]}"
done
rm -f info.txt
echo && echo

For any questions, do let me know. Hope this helps. Thanks.

Wednesday, 14 June 2017

Automating vSphere Replication Deployment Using Bash Script

If you ever run into issues with vSphere Replication deployment in 6.5 or any other versions, you can use ovf tool to perform this deployment. Ovf tool might look a little complex to perform the deployment as the commands are quite big.

For windows version refer the below link:
http://www.virtuallypeculiar.com/2017/06/deploying-vsphere-replication-using-ovf.html

If you have a Linux based environment you can use this shell script I wrote to automate the process. (User intervention needed to enter environment details, duh!)

To download the script and the ReadMe (Before running anything) refer:
https://github.com/happycow92/vsphere-replication-deploy/

I will be making a few changes to this in the coming days, but the base functionality will remain more or less the same.

ReadMe along with Change Log for 2017-06-16

1. Download the script
2. Place it in the /root directory of the Linux machine
3. Provide execute permissions for the script
chmod a+x vr_deploy.sh
4. Mount the VR ISO to the Linux VM using vSphere / Web Client
5. Execute the script
./vr_deploy.sh

The script tests for network connectivity.
if successful, then it will download the 4.2 version of ovftool and install it and then prompt the user for details

if unsucessful, then the script will exit and the ovftool has to be installed manually on the linux, or use windows method to deploy.

Changelog:
If you do not have OVF tool on linux and you also do not have internet access from Linux, then download OVF tool 4.2 manually from VMware website (.bundle file for Linux)
Put this file in /root directory of the Linux machine.
Install the OVF tool using:
# sudo /bin/sh VMware-ovftool-4.2.0-4586971-lin.x86_64.bundle

Once installed, then run the script. The script then checks if OVF tool is present. If present, it will continue further.
This is updated as some of the Linux boxes will not have internet access.

Hope this helps.