Time Drift between Cluster Nodes in vROps and the Consequences

I had an interesting issue in one of my labs recently where the vROps cluster suddenly stopped collecting and showed No Data against the vCenter.

After logging into Admin UI I found that the Master Node in my 2-Node cluster was in Waiting for Analytics state.

sm_vrops_adminui_error

Now the best way to troubleshoot this was to check the logs but because the Master Node was in Waiting for Analytics status this was not possible from the UI so I had to fire up an SSH / Putty session as Log Insight was not present in this particular lab.

In general the vROps logs for troubleshooting are located in /storage/log/vcops/log but the particular log I was interested in was the most recent analytics log (analytics-16982cf5-b677-4b75-8281-97d1321e376c.log).

What caused this issue?

After scrolling to the bottom of the log I could see the error messages preventing Analytics from starting up:

INFO [Analytics Main Thread ] com.vmware.vcops.platform.gemfire.GemfireFunctionHandler.registerHandler – Register ControllerManagementServer for interface ControllerManagementInterface.

INFO [Analytics Main Thread ] com.vmware.vcops.platform.gemfire.GemfireFunctionHandler.registerHandler – Register ControllerManagementServer for interface EntityManagementInterface.

ERROR [Analytics Main Thread ] com.integrien.alive.controller.Controller.verifyTimeDifferenceBetweenServers – Time difference between servers is:46305 ms. It is greater than 30000 ms. Unable to operate, terminati
ng…

ERROR [Analytics Main Thread ] com.integrien.analytics.AnalyticsMain.run – AnalyticsMain.run failed with error: IllegalStateException: Time difference between servers is:46305 ms. It is greater than 30000 ms. Una
ble to operate, terminating…
java.lang.IllegalStateException: Time difference between servers is:46305 ms. It is greater than 30000 ms. Unable to operate, terminating…
at com.integrien.alive.controller.Controller.verifyTimeDifferenceBetweenServers(Controller.java:1395)
at com.integrien.alive.controller.Controller.doRun(Controller.java:705)
at com.integrien.analytics.AnalyticsMain.doRun(AnalyticsMain.java:378)
at com.integrien.analytics.AnalyticsMain.run(AnalyticsMain.java:1548)

Here we can see vROps is detecting a time difference between the Master and Data nodes which is greater than 30000ms or 30 secs in this case 46305ms.  As a general rule, Analytics will not start up if time difference is more than 30 secs to avoid time drift between cluster nodes.

To verify this, I fired up 2 additional putty sessions and ran the date command in a watch loop, sure enough I could definitely see a time difference between the 2 nodes.

vrops66_sshdate

vrops66-2_sshdate

How was the issue resolved?

To resolve this issue I reset the date for the Data node to be within 30 seconds of the Master Node, once I did this I was able to restart Analytics successfully on the Master Node and after a few more collection cycles, vROps was working as expected and data was again collected from the vCenter.

sm_vrops_sshdate

sm_vrops_solutionsworking

Conclusion:

  • Ideally use an NTP server to keep your cluster nodes in sync and to avoid time drift
  • Ensure that time sync between all cluster nodes is within 30 seconds else you will see problems within your vROps cluster.

 

Thanks for reading and please feel free to leave a comment or message me on twitter (@lukaswinn) if you found this article useful.

— Lukas

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Powered by WordPress.com.

Up ↑

%d bloggers like this: