Showing posts with label vSphere6x. Show all posts
Showing posts with label vSphere6x. Show all posts

Sunday, 8 July 2018

Unable To Pair SRM Sites: "Server certificate chain not verified"

So first things first, as of this post, you might know that I have moved out of VMware and ventured further into backup and recovery solutions domain. Currently, I work as a solutions engineer at Rubrik.

There are a lot of instances where you are unable to manage anything in Site Recovery Manager; regardless of the version (Also, applicable to vSphere Replication) and the common error that pops up on the bottom right of the web client is Server certificate chain not verified

Failed to connect to vCenter Server at vCenter_FQDN:443/sdk. Reason:
com.vmware.vim.vmomi.core.exception CertificateValidationException: Server certificate chain not verified.

This article will briefly explain only on the embedded Platform Services Controller deployment model. Similar logic needs to be extrapolated to the external deployments.

These issues are ideally seen when:
> PSC is migrated from embedded to external
> Certificates are replaced for the vCenter

I will be simplifying this KB article here for reference. Before proceeding have a powered off snapshot of the PSC and vCenters involved. 

So, for embedded deployment of VCSA:

1. SSH into the VCSA and run the below command:
# /usr/lib/vmidentity/tools/scripts/lstool.py list --url https://localhost/lookupservice/sdk --no-check-cert --ep-type com.vmware.cis.cs.identity.sso 2>/dev/null

This command will give you the ssl Trust that is currently stored on your PSC. Now, consider you are using an embedded PSC deployment on production and another embedded deployment in DR (No Enhanced Linked Mode). In this case, when you run the above command, you are expected to see just one single output where the URL section is the FQDN of your current PSC node and associated with it would be its current ssl trust. 

URL: https://current-psc.vmware.local/sts/STSService/vsphere.local
SSL trust: MIIDWDCCAkCgAwIBAgIJANr+++MJ5+WxMA0GCSqGSIb3DQEBCwUAMGgxCzAJBgNV
...Reducing output...
LqSKWg/apP1XlBV1VyC5LPZrH/rrq8+Naoj7i/P6HAzTwAAL+O10ggClaP8=

If this is your case, proceed to step (2), if not jump to step (3)

2. Run the next command:
# echo | openssl s_client -connect localhost:443

This is the current ssl Trust that is used by your deployment post the certificate replacement. Here look at the extract which speaks about the certificate chain.

Server certificate
-----BEGIN CERTIFICATE-----
MIIDWDCCyAHikleBAgIJANr+++MJ5+WxMA0GCSqGSIb3DQEBCwUAMGgxCzAJBgNV
...
LqSKWg/apP1XlBV1VyC5LPZrH/rrq8+Naoj7i/P6HAzTwAAL+O10fGhhDDqm=
-----END CERTIFICATE-----

So from here, the chain obtained from the first command (current ssl trust in PSC) does not match the chain from the second command (Actual ssl trust). And due to this mismatch you would see the chain not verified message in the UI.

To fix this, the logic is; find all the services using the thumbprint of the old ssl Trust (step 1) and update them with the thumbprint from step 2. The steps in KB article is a bit confusing, so this is what I follow to fix it.

A) Copy the SSL trust you obtained from the first command to Notepad++ Everything that starts from MIID... in my case (No need to include SSL Trust option in it).

B) The chain should contain 65 characters in one line. So in notepad++ place the mouse after a character and see what the col option reads at the bottom. Hit Enter at the mark when col: 65
Format this for the complete chain (The last line might have <65 characters which is okay)

C) Append -----BEGIN CERTIFICATE----- and -----END CERTIFICATE----- before and after the chain (5 hyphens are used before and after)

D) Save the Notepadd++ document as a .cer extension.

E) Open the certificate file that you saved and navigate to Details > Thumbprint. You will notice a string of hexa with spacing after every 2 characters. Copy this to a Notepadd++ and append : after every 2 characters, so you will end up with the thumbprint similar to: 13:1E:60:93:E4:E6:59:31:55:EB:74:51:67:2A:99:F8:3F:04:83:88

F) Next, we will export the current certificate using the below command
# /usr/lib/vmware-vmafd/bin/vecs-cli entry getcert --store MACHINE_SSL_CERT --alias __MACHINE_CERT --output /certificates/new_machine.crt

This will export the current certificate to the /certificates directory. 

G) Run the update thumbprint option using the below command:
# cd /usr/lib/vmidentity/tools/scripts/
# python ls_update_certs.py --url https://FQDN_of_Platform_Services_Controller/lookupservice.sdk --fingerprint Old_Certificate_Fingerprint --certfile New_Certificate_Path_from_/Certificates --user Administrator@vsphere.local --password 'Password'

So a sample command would be:
# python ls_update_certs.py --url https://vcsa.vmware.local --fingerprint 13:1E:60:93:E4:E6:59:31:55:EB:74:51:67:2A:99:F8:3F:04:83:88 --certfile /certificates/new_machine.crt --user Administrator@vsphere.local --password 'Password'

This will basically look for all the services using the old thumbprint (04:83:88) and then update them with the current thumbprint from the new_machine.crt

Note: Once you paste the thumbprint in the SSH, remove any extra spaces before and after the beginning and end of thumbprint respectively. I have seen the Update service fail because of this as it picks up some special character in few cases (Special characters that you would not see in the terminal) So remove the space after fingerprint and re-add the space back. Do the same before the --certfile switch too.

Re-run the commands in step 1 and 2 and the SSL trust should now match. If yes, then re-login back to the web client and you should be good to go. 

--------------------------------------------------------------------------------------------------------------------------
3. In this case, you might see two PSC URL outputs in the step 1 command. Both the PSC have the same URL, and it might or might not have the same SSL trust. 

Case 1:
Multiple URL with different SSL trust.

This would be a easy one. On the output from Step 1, you will see two outputs with same PSC URL, but different SSL trust. And one of the SSL trust from here will match the current certificate from step 2. So this means, the one that does not match is the stale one and can be removed from the STS store. 

You can remove them from the CLI, however, I stick to using Jxplorer tool to remove it from the GUI. You can connect to PSC from Jxplorer using this KB article here.

Once connected, navigate to Configuration > Sites > LookupService > Service Registrations. 
One of the fields from command in step 1 is Service ID. Which is something similar to:
Service ID: 04608398-1493-4482-881b-b35961bf5141

Locate this similar service ID in the service registrations and you should be good to remove it. 

Case 2:
Multiple URL with same SSL trust. 

In this case, after the output from step 1, you will see two same PSC URL along with the same SSL trust. And these might or might not match the output from step 2. 

The first step of this fix is:

Note down both of the service IDs from the output of step 1. Connect the Jxplorer as mentioned above. Select the service ID and on the right side, click Table Editor view and click submit. You can view the last modified date of this service registration. The service ID having the older last modified date would be the stale registration and can be removed via Jxplorer. Now, when you run the command from Step 1, it would have one output. If this matches the thumbprint from step 2, great! If not, then an additional step of updating the thumbprint needs to be performed. 

In an event of external PSC deployment, let's say one PSC in production site and one in recovery site in ELM, then the command from step 1 is supposed to populate two outputs with two different URL (production and DR PSC) since they are replicating. This will of course change if there are multiple PSCs replicating with or without a load balancer. The process would be too complex to explain using text, so in this event it would be best to involve VMware Support for assistance. 

Hope this helps!

Friday, 20 April 2018

Upgrading vCenter Appliance From 6.5 to 6.7

So as you know, vSphere 6.7 is now GA and this article will speak about upgrading an embedded PSC deployment of 6.5 vCenter appliance to 6.7. Once you download the 6.7 VCSA ISO installer
mount the ISO on a local windows machine and then you can use the ui installer for windows to begin the upgrade phase.

You will be presented with the below choices:


We will be going with the Upgrade option. The upgrade is going to be like the earlier path wherein the process will deploy a new 6.7 VCSA and perform a data and configuration migration from the older 6.5 appliance and then power down the old server when the upgrade is successful.


Accept the EULA to proceed further.


In the next step we will connect to the source appliance so provide in the IP/FQDN of the source 6.5 vCenter server.


Once the Connect To Source goes through you will then be asked to enter the SSO details and the ESX details where the 6.5 vCSA is running.


Then the next step is to provide information about the target appliance, the 6.7 appliance. You will select the ESX where the target appliance should be deployed.


Then provide the inventory display name for the target vCenter 6.7 along with the a root password.


Select the appliance deployment size for the target server. Make sure this matches or is greater than the source 6.7 server.


Then select the datastore where the target appliance should reside.


Next, we will provide a set of temporary network details for the 6.7 appliance. The appliance will inherit the old 6.5 network configuration post a successful migration.


Review the details and Finish the begin the Stage 1 deployment process.


Once the Stage 1 is done, you can Continue to proceed further with the Stage 2



In the Stage 2 we will be performing a data copy from the source vCenter appliance to the deployed target from Stage 1


Provide in the details to connect to the source vCenter server.


Select the type of data to be copied over to the destination vCenter server. In my case, I just want to migrate the configuration data.


Join the CEIP and proceed further


Review the details and Finish to begin the data copy.


The source vCenter will be shutdown post the data copy.


The data migration will take a while to complete and is in 3 stages.


And that's it. If all goes well, the migration is complete and you can access your new vCenter from the URL.

Hope this helps.

Saturday, 19 November 2016

VDP 6.1.3 First Look

So over the last week, there has been multiple products from VMware going live, and the most awaited one was the vSphere 6.5.
With vSphere 6.5, the add-on VMware products have to be on their compatible versions. This one is specifically dedicated to vSphere Data Protection 6.1.3. I will try to keep this post short, and mention the changes I have seen post deploying this in my lab. There is already a release notes for this, which covers most of the fixed issues and known issues in the 6.1.3 release.

This article speaks only about the issues that is not included in the release notes.

Issue 1: 

While using internal proxy for this appliance, the vSphere Data Protection configure page comes up blank for Name of the proxy, ESX Host where the proxy is deployed and the Datastore where the proxy resides.

The vdr-configure log does not have any information on this and a reconfigure / disable and re-enable of internal proxy does not fix this.



Workaround:
A reboot of appliance populates the right information back in the configure page.

Issue 2:

This is an intermittent issue and seen only during fresh deployment of the 6.1.3 appliance. The vSphere Data Protection plugin is not available in the web client. The VDP plugin version for this release is com.vmware.vdp2-6.1.3.70, and this folder is not created in the vsphere-client-serenity folder in the vCenter Server. The fix is similar to this link here.

Issue 3:

vDS Port-Groups are not listed during deployment of external proxies. The drop-down shows only standard switch port-groups.

Workaround:
Deploy the external proxy on a standard switch and then migrate it to a distributed switch. The standard switch port group you create does not need any up-links as this will be migrated to vDS soon after the deployment is completed.

Issue 4:

VM which is added to a backup job comes as unprotected after a rename on the VM is done. VDP does not sync it's naming with vCenter inventory automatically.

Workaround:
Force sync the vCenter - Avamar Names using the proxycp.jar utility. The steps can be found in this article here.

Issue 5:

Viewing logs for a failed backup from Job Failure tab does not return anything. The below message is seen while trying to view the logs:


This was seen in all version of 6.x even if none of the mentioned reasons are true. EMC has acknowledged this bug, however there is no fix for it currently.

Workaround:
View logs from Task Failure tab, command line or gather logs from vdp-configure page.

Issue 5:
Backups fail when the environment is running ESXi 5.1 with VDP 6.1.3
GUI Error is somewhat similar to: Failed To Attach Disk

The jobs fail with:

2016-11-29T16:03:04.600-02:00 avvcbimage Info <16041>: VDDK:2016-11-29T16:03:04.600+01:00 error -[7FF94BA98700] [Originator@6876 sub=Default] No path to device LVID:56bc8e94-884c9a71-4dee-f01fafd0a2c8/56bc8e94-61c7c985-1e2d-f01fafd0a2c8/1 found.
2016-11-29T16:03:04.600-02:00 avvcbimage Info <16041>: VDDK:2016-11-29T16:03:04.600+01:00 error -[7FF94BA98700] [Originator@6876 sub=Default] Failed to open new LUN LVID:56bc8e94-884c9a71-4dee-f01fafd0a2c8/56bc8e94-61c7c985-1e2d-f01fafd0a2c8/1.
2016-11-29T16:03:04.600-02:00 avvcbimage Info <16041>: VDDK:-->
2016-11-29T16:03:05.096-02:00 avvcbimage Info <16041>: VDDK:VixDiskLib: Unable to locate appropriate transport mode to open disk. Error 13 (You do not have access rights to this file) at 4843.

There is no fix for this. Suggest you to upgrade your ESXi to 5.5 as there are compatibility issues with 5.1


Fixed Issue: Controlling concurrent backup jobs. (Not contained in Release Notes)

In VDP releases prior to 6.1.3 and after 6.0.x, the vdp-configure page provided an option to control how many backup jobs should run at a time. The throttling was set under the "Manage Proxy Throughput" section. However, this never worked for most deployments.

This is fixed in 6.1.3

The test:

Created a backup job with 5 VMs.
Set the throughput to 1. 5 iterations of backup were executed - Check Passed
Set the throughput to 2. 3 iterations of backup were executed - Check Passed

Set 2 external proxies and throughput to 1
The throughput would be 2 x 1 = 2
3 Iterations of backups were executed - Check Passed.

Re-registered the appliance. Same test - Check Passed
vMotion the appliance. Same test - Check Passed Reboot the
VDP. Same test - Check Passed


I will update this article as and when I come across new issues / fixes that is NOT included in the vSphere Data Protection 6.1.3 release notes.

Thursday, 17 November 2016

vSphere 6.5: Installing vCenter Server Appliance Via Command Line

Along with the GUI method of deploying vCenter appliance, there is a command line path as well, which I would say is quite fun and easy to follow. There are a set of pre-defined templates available and these are on the vCenter 6.5 appliance ISO file. Download and mount the VCSA 6.5 ISO and browse to

CD Drive:\vcsa-cli-installer\templates\install

You will see the following list of templates:

You can choose the required template from here for deployment. I will be going with the embedded VCSA being deployed on an ESXi host. So my template will be embedded_vCSA_on_ESXi.json

Open the required template using a notepad. The notepad has a list of details that you need to fill out. For my case, it was to provide ESXi host details, appliance details, networking details for the appliance, Single Sign On details. The file looks somewhat similar to the below image, post the edit:


Save the file as .json on to your desktop. Next, you will have to call this file using the vcsa-deploy.exe file. 
In the CD drive, browse to vcsa-cli-installer-win32/lin64/mac depending on the OS you are accessing this from and run the below command from PowerShell

vcsa-deploy.exe install --no-esx-ssl-verify --accept-eula --acknowledge-ceip “Path to the json file”

If there are errors in the edited file, it is going to display what the error is and in which line of the file this error is contained. The first step is a template verification, and if the template verification completes successfully, you should be seeing the below message:


The next step it starts automatically is the appliance deployment, and you will see the below task in progress:


Once the appliance is deployed, the final stage is configuring of the services. At this point you will see the below task in progress:


That's pretty much it, you can go ahead and login to web client (Flash / HTML)

Hope this helps!

Wednesday, 16 November 2016

vSphere 6.5: Login To Web Client Fails With Invalid Credentials

So, today I was making certain changes on my password policies on vSphere 6.5 and I ran into an interesting issue. I had created a user in the SSO domain (vmware.local), called as happycow@vmware.local and I tried to login to web client with this user. However, the login failed with the error: Invalid Credentials.


In the vmware-sts-idmd.logs located at C:\ProgramData\VMware\vCenterServer\logs\sso the following were noticed:

[2016-11-16T12:51:22.541-08:00 vmware.local         6772f8c3-7a11-479e-a224-e03175cc1b1a ERROR] [IdentityManager] Failed to authenticate principal [happycow@vmware.local]. User password expired. 
[2016-11-16T12:51:22.542-08:00 vmware.local         6772f8c3-7a11-479e-a224-e03175cc1b1a INFO ] [IdentityManager] Authentication failed for user [happycow@vmware.local] in tenant [vmware.local] in [115] milliseconds with provider [vmware.local] of type [com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider] 
[2016-11-16T12:51:22.542-08:00 vmware.local         6772f8c3-7a11-479e-a224-e03175cc1b1a ERROR] [ServerUtils] Exception 'com.vmware.identity.idm.PasswordExpiredException: User account expired: {Name: happycow, Domain: vmware.local}' 
com.vmware.identity.idm.PasswordExpiredException: User account expired: {Name: happycow, Domain: vmware.local}
at com.vmware.identity.idm.server.provider.vmwdirectory.VMwareDirectoryProvider.checkUserAccountFlags(VMwareDirectoryProvider.java:1378) ~[vmware-identity-idm-server.jar:?]
at com.vmware.identity.idm.server.IdentityManager.authenticate(IdentityManager.java:3042) ~[vmware-identity-idm-server.jar:?]
at com.vmware.identity.idm.server.IdentityManager.authenticate(IdentityManager.java:9805) ~[vmware-identity-idm-server.jar:?]
at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_77]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_77]
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:323) ~[?:1.8.0_77]
at sun.rmi.transport.Transport$1.run(Transport.java:200) ~[?:1.8.0_77]
at sun.rmi.transport.Transport$1.run(Transport.java:197) ~[?:1.8.0_77]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_77]
at sun.rmi.transport.Transport.serviceCall(Transport.java:196) ~[?:1.8.0_77]
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:568) ~[?:1.8.0_77]
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826) ~[?:1.8.0_77]
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:683) ~[?:1.8.0_77]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_77]
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682) [?:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]

The account and it's password was coming up as expired. I was able to login to Web Client with the default SSO user account without issues.

This issue occurs when the SSO password expiration lifetime has a larger value than the maximum value permitted.

Under Administration > Configuration > Policies, the password expiration was set to 36500 days. KB 2125495 similar to this, talks about this value should be less than 999999.

Changed this value to 3650 days (10 years) and the other users under the SSO were able to login. The same is seen on 6.0 as well, with a different error: Authentication Failure.

Tuesday, 15 November 2016

vSphere 6.5: Installing vCenter Appliance 6.5

With the release of vSphere 6.5, the installation of vCenter appliance just got a whole lot easier. Earlier, we required client integration plugin to be available, and then the deployment was done through a browser. And as we know, client integration plugin had multiple compatibility issues. Well, no more of client integration plugin is used. The deployment is going to be via an ISO which would have an installation wizard that can be executed on Windows, MAC or Linux.

The vCenter Server Appliance consists of a 2-stage deployment.
1. Deploying VCSA 6.5
2. Configuring VCSA 6.5

Deploying VCSA 6.5

Download the vCenter Server appliance from this link here. Once the download is complete, mount the ISO onto any machine and run the installer. You should be seeing the below screen.


We will be choosing the Install option as this is a fresh deployment. The description then shows that there are two steps involved in the installation. The first step will deploy a vCenter Server appliance and the second step will be configuring this deployed appliance.


Accept the EULA


Choose the type of deployment that is required. I will be going with an embedded Platform Services Controller deployment.


Next, choose the ESXi host where you would like to have this vCenter appliance deployed and provide the root credentials of the host for authentication.


Then, provide a name for the vCenter appliance VM that is going to be deployed and set the root password for the appliance.


Based upon your environment size, select the sizing of the vCenter appliance.


Select the datastore where the vCenter appliance files need to reside.


Configure the networking of vCenter appliance. Please have a valid IP which can be resolved both forward / reverse prior to this to prevent any failures during installation.


Review and finish the deployment, and the progress for stage 1 begins.


Upon completion, you can Continue to proceed to configure the appliance. If you close this window out, then you need login to the web management page for VCSA in the https://vcenter-IP:5480 to continue with the configuration. In this scenario, I will choose the Continue option to proceed further.

Configuring VCSA 6.5



The stage 2 wizard begins at this point. The first section is to configure NTP for the appliance and enable Shell access for the same.


Here, we will mention the SSO domain name, the SSO password and the Site name for the appliance.
In the next step, if you would like to enable Client Experience Improvement Program, you can do so, else you can skip and proceed to completion.


Once the configuration wizard is completed the progress for Stage 2 begins.


Once the deployment is complete you can login to the web client (https://vCenter-IP:9443/vsphere-client) or the html 5 client (https://vCenter-IP/ui) The HTML web client is available only with vCenter server appliance.

vSphere 6.5: What is vCenter High Availability

In 6.0 we had the option to provide high availability for the Platform Services Controller by deploying redundant PSC nodes in the same SSO domain and utilizing a manual re point command or a Load balancer to switch to a new PSC if the current one was down. However, for vCenter nodes there was no such option available, and the only way to have HA for vCenter node was to either configure Fault Tolerance or have the vCenter virtual machine in a HA enabled cluster.

Now with the release of vSphere 6.5, there has been a new much awaited feature added to provide redundancy or high availability for your vCenter node too. This is the VCHA or the vCenter High Availability feature.

The design of VCHA is somewhat similar to your regular clustering mechanism. Before we get to the working of this, here are few prerequisites for VCHA to work:

1. Applicable to vCenter Server Appliance only. Embedded VCSA is currently not supported.
2. Three unique ESXi hosts. One for each node (Active, Passive and Witness)
3. Three unique datastores to contain each of these nodes.
4. Same Single Sign On Domain for Active and Passive nodes
5. One public IP to access and use vCenter
6. 3 Private IP in a different subnet to that of public IP. This will be used for internal communication to check node state.

vCenter High Availability (VCHA) Deployment:
There are three nodes available or deployed once your vCenter is configured for high availability. Active node, Passive node and the Witness (Quorum) node. The active node will be the one that would have the Public IP vNIC in up state. This public IP will be used to access and connect to your vSphere Web Client for management purpose.

The second node is the Passive node which is the exact clone of the active node. It has the same memory, CPU and disk configurations as that of the Active node. The public IP vNIC will be down for this node and the vNIC used for Private IP will be up. The private network between Active and Passive is for cluster operations. The active node will have it's database and files updated regularly and this has to be synced across the Passive node, and these information will be synced over the Private network.

The third node, also called as quorum node acts as a witness. This node is introduced to avoid split-brain scenario which arises due to network partition. In a case of network partition we cannot have two active nodes up and running and the quorum node decides which node is active and which has to be passive.

The vPostgres Replication is used to enable database replication between active and passive nodes and this is a synchronous replication. The vCenter files are replicated using native Linux Rsync which is a asynchronous replication.

 What happens during a failover?

When the active node goes down, the passive node becomes the active and assumes the public IP address. The state of the VCHA cluster enters a degraded state since one of the node is down. The recovery time is not transparent and there will be a RTO of ~5 minutes.

Also, your cluster can enter a degraded state when your active node is still running in a healthy state, but either the passive or the witness node are down. In short, if one node in the cluster is down, then the VCHA is in a degraded state. More about VCHA states and deployment will be in a later article.

Hope this was helpful.