Self-healing
Overview
The main goal of self healing mechanisms is to recover services from incidents and failures and make them available as fast as possible. In order to achieve this goal JLupin automatically react when something unexpected is happening with microservices. We introduced three self-healing processes:
- Process Keeper
- Technical Process Keeper
- Memory Process Keeper
They are used by core component called Process Manager. All incidents are also logged in event log which is avaiable for users and it simplifies process of monitoring incidents in evnironment.
Process Manager
Process Manager is a parf of platform which is responsible for managing all microservices' processes (there are actually two instances of process manager - one is managing microserivces and one is managing technical microservices) and for communicating with them. It is for example starting or restrating microservices, communicating over JLRMC to execute services and similiar. It also cotains self-healing components which are reacting to system events.
Process Keeper
Process Keeper is a part of Process Manager (running inside Main Server) that monitors the existence of microserices' process on OS level.
If something (or somebody :)) kills such process or it crushes (there some conditions under which JVM that automatically terminated by itself) Process Keeper detects such incidents end perform start procedure of the microservice immediately.
Produced event logs
Event Type | Description |
---|---|
START_MICROSERVICES_STOPPED_OUTSIDE_JLUPIN_BEFORE | Event descirption contains information which processes are going to be started due to being stopped abnormally (not by Process Manager itself). This event is created before starting this microservices. |
START_MICROSERVICES_STOPPED_OUTSIDE_JLUPIN_AFTER | This event is create after abnormally stopped microservices were started. It contains information about start process and their statuses after start (start may not be executed correclty). |
Technical Process Keeper
Technical Process Keeper is a part of Process Manager (running inside Main Server) that monitors the existence of technical microserices' process on OS level. In JLupin Platform 1.5 we've introduced one technical microservice - Edge Balnacer (nginx), which is very critical from service accessibility point of view.
If something (or somebody :)) kills such process or it crushes (os.exit()
in wring place ;) ) Process Keeper detects such incidents end perform start procedure of the microservice immediately.
Produced event logs
Event Type | Description |
---|---|
PROCESS_PID_IS_NOT_ALIVE | This event is created when JLupin discovers that technical process'es PID is not alive any more. |
PROCESS_PID_IS_NOT_ALIVE_ADD_TO_RESTART_LIST | This event is created when not running process is added to restart list due to fact that is should be running. |
START_TECHNICAL_PROCESSES_STOPPED_OUTSIDE_JLUPIN_BEFORE | Event descirption contains information which processes are going to be started due to being stopped abnormally (not by Process Manager itself). This event is created before starting this techcnial microservices. |
START_TECHNICAL_PROCESSES_STOPPED_OUTSIDE_JLUPIN_AFTER | This event is create after abnormally stopped technical microservices were started. It contains information about start process and their statuses after start (start may not be executed correclty). |
Memory Process Keeper
Memory Process Keeper is a part of Process Manager (running inside Main Server) that monitors if one of the following exceptions on the level of microservice's JVM occurs:
- Out Of Memory
- StackOverflow
If one of these exception is detected Memory Process Keeper starts a new instance of microservice with 15% (by default) more resources for JVM while from the previous instance diagnostic data is gathered and archived (heap dump, class in which the exception occurred, a log from new instance startup). The point is to keep the service available as long as possible, increase capacity to prevent (or postpone) another occurrence of this incident and collect necessary data to find a root cause.
The early detection of above exception allows to provide a new instance of microservice using zero downtime deployment mechanisms and protect service against timeouts or even unavailability.
After dumping is complete and new instance is active, the previous one is destroyed. In that way environment goes back to the original state with a little difference - one of microserices has more capacity for processing requests.
You can configure paths were dumps and logs are created - read more about it here. By default they are located in directories under platform/logs/memory_manager/
.
Produced event logs
Event Type | Description |
---|---|
MEMORY_ERROR_WITH_RESTART_BEFORE | Event descirption contains information which process is going to be restarted due to out of memory or stack overflow error. This event is created before restarting microservice and contains additional infomration about new memory allocation. |
MEMORY_ERROR_WITHOUT_RESTART | This event is generated when microservice won't be restarted due to reached maximum count of restarts reached. |
MEMORY_ERROR_WITH_RESTART_AFTER | This event is create after restarting microservices. It contains information about restart process and their statuses after it (restart may not be executed correclty). |