Monitoring & Automation - is your app 100% available?
Description
Read this if you need to make your mission-critical business application or technology 100% available (excluding upgrade-related downtime and issue that cannot be worked around by the restart of one or several software component). You could use the same approach or (if you need this to work in a self-contained/single on-premise deployment) you can alternatively have this running in a watchdog service.
Cloud Germain has become 100% available (excluding upgrade-related downtime and issue that cannot be worked around by the restart of one or several software component) owing to the monitoring and auto-restart mechanisms that we have recently configured in Germain (mechanims that has been in the automation framework ever since 2014 yet that we never really used for our self!). These monitoring and auto-restart mechanisms are currently running on a dedicated Germain cloud instance which in charge of monitoring all other Germain cloud instances (hosted on AWS).
Configuration
The relevant configuration in Germain:
- SLA for the query that gets status of services from cloud-mgmt system:
- Action to kick off the restart script:
- cloud-service-restart.sh
cloud-service-restart.sh
Notes:
- The concept of “order” doesn’t apply – any one service will be restarted if it becomes unresponsive
- The only condition that is handled is if the service doesn’t respond to 5 consecutive failures to respond over the network (Germain cloud-mgmt executes 1 such check per minute for each service it manages)
- all these are examples and can configured to behave differently..feel free to ask us if needed.