A recent customer question about alerting on Uptime values from the REST API got me digging into the Health and Wellness Policies for a better solution.
The request was to alert when the uptime value for specific device families was reset indicating that something had occured with the service and reset the uptime value. Repeated resets of the uptime value could indicate an issue with the service that needed attention (core files created as a result of decoder service crashes was the root of this request).
Here is my solution:
Now you have a policy that alerts when the uptime is within the first 60 seconds of restarting (.. is two digits so up to 60 seconds) and recovers once the uptime doesnt match the pattern (when 60 seconds switches to minute and seconds (61 seconds +)
Alarm
Recovery
Details on the pattern developed:
number of seconds followed by a comma then the friendly time breakdown of the seconds in years, months, weeks, days, hours, minutes and seconds.
.. = looked for 2 digits for the seconds (between 10-59 seconds after service restarted)
, .. = looked for the same seconds value after the comma
seconds.* = the word seconds and the trailing space in the value
when this pattern is matched (between 10-59 seconds after restart) there will be an alarm, then it will clear when that pattern is not matched (60 seconds +)
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.