Self-healing

Overview

The main goal of self healing mechanisms is to recover services from incidents and failures and make them available as fast as possible. In order to achieve this goal JLupin automatically react when something unexpected is happening with microservices. We introduced three self-healing processes:

  • Process Keeper
  • Technical Process Keeper
  • Memory Process Keeper

They are used by core component called Process Manager. All incidents are also logged in event log which is avaiable for users and it simplifies process of monitoring incidents in evnironment.

Process Manager

Process Manager is a parf of platform which is responsible for managing all microservices' processes (there are actually two instances of process manager - one is managing microserivces and one is managing technical microservices) and for communicating with them. It is for example starting or restrating microservices, communicating over JLRMC to execute services and similiar. It also cotains self-healing components which are reacting to system events.

Process Keeper

Process Keeper is a part of Process Manager (running inside Main Server) that monitors the existence of microserices' process on OS level.

If something (or somebody :)) kills such process or it crushes (there some conditions under which JVM that automatically terminated by itself) Process Keeper detects such incidents end perform start procedure of the microservice immediately.

Figure 1. Self-healing - Process Keeper.

Produced event logs

Event Type Description
START_MICROSERVICES_STOPPED_OUTSIDE_JLUPIN_BEFORE Event descirption contains information which processes are going to be started due to being stopped abnormally (not by Process Manager itself). This event is created before starting this microservices.
START_MICROSERVICES_STOPPED_OUTSIDE_JLUPIN_AFTER This event is create after abnormally stopped microservices were started. It contains information about start process and their statuses after start (start may not be executed correclty).

Technical Process Keeper

Technical Process Keeper is a part of Process Manager (running inside Main Server) that monitors the existence of technical microserices' process on OS level. In JLupin Platform 1.5 we've introduced one technical microservice - Edge Balnacer (nginx), which is very critical from service accessibility point of view.

If something (or somebody :)) kills such process or it crushes (os.exit() in wring place ;) ) Process Keeper detects such incidents end perform start procedure of the microservice immediately.

Figure 2. Self-healing - Technical Process Keeper.

Produced event logs

Event Type Description
PROCESS_PID_IS_NOT_ALIVE This event is created when JLupin discovers that technical process'es PID is not alive any more.
PROCESS_PID_IS_NOT_ALIVE_ADD_TO_RESTART_LIST This event is created when not running process is added to restart list due to fact that is should be running.
START_TECHNICAL_PROCESSES_STOPPED_OUTSIDE_JLUPIN_BEFORE Event descirption contains information which processes are going to be started due to being stopped abnormally (not by Process Manager itself). This event is created before starting this techcnial microservices.
START_TECHNICAL_PROCESSES_STOPPED_OUTSIDE_JLUPIN_AFTER This event is create after abnormally stopped technical microservices were started. It contains information about start process and their statuses after start (start may not be executed correclty).

Memory Process Keeper

Memory Process Keeper is a part of Process Manager (running inside Main Server) that monitors if one of the following exceptions on the level of microservice's JVM occurs:

  • Out Of Memory
  • StackOverflow

If one of these exception is detected Memory Process Keeper starts a new instance of microservice with 15% (by default) more resources for JVM while from the previous instance diagnostic data is gathered and archived (heap dump, class in which the exception occurred, a log from new instance startup). The point is to keep the service available as long as possible, increase capacity to prevent (or postpone) another occurrence of this incident and collect necessary data to find a root cause.

Figure 3. Self-healing - Memory Process Keeper (1).

The early detection of above exception allows to provide a new instance of microservice using zero downtime deployment mechanisms and protect service against timeouts or even unavailability.

After dumping is complete and new instance is active, the previous one is destroyed. In that way environment goes back to the original state with a little difference - one of microserices has more capacity for processing requests.

Figure 4. Self-healing - Memory Process Keeper (4).

You can configure paths were dumps and logs are created - read more about it here. By default they are located in directories under platform/logs/memory_manager/.

Produced event logs

Event Type Description
MEMORY_ERROR_WITH_RESTART_BEFORE Event descirption contains information which process is going to be restarted due to out of memory or stack overflow error. This event is created before restarting microservice and contains additional infomration about new memory allocation.
MEMORY_ERROR_WITHOUT_RESTART This event is generated when microservice won't be restarted due to reached maximum count of restarts reached.
MEMORY_ERROR_WITH_RESTART_AFTER This event is create after restarting microservices. It contains information about restart process and their statuses after it (restart may not be executed correclty).