Self-healing

Overview

The main goal of self healing mechanisms is to recover services from incidents and failures and make them available as fast as possible. In order to achieve this goal JLupin automatically react when something unexpected is happening with microservices. We introduced three self-healing processes:

  • Process Keeper
  • Technical Process Keeper
  • Memory Process Keeper

Process keeper

Process Keeper is a part of Process Manager (running inside Main Server) that monitors the existence of microserices' process on OS level.

If something (or somebody :)) kills such process or it crushes (there some conditions under which JVM that automatically terminated by itself) Process Keeper detects such incidents end perform start procedure of the microservice immediately.

Figure 1. Self-healing - Process Keeper.

Technical Process keeper

Technical Process Keeper is a part of Process Manager (running inside Main Server) that monitors the existence of technical microserices' process on OS level. In JLupin Platform 1.5 we've introduced one technical microservice - Edge Balnacer (nginx), which is very critical from service accessibility point of view.

If something (or somebody :)) kills such process or it crushes (os.exit() in wring place ;) ) Process Keeper detects such incidents end perform start procedure of the microservice immediately.

Figure 2. Self-healing - Technical Process Keeper.

Memory Process keeper

Technical Process Keeper is a part of Process Manager (running inside Main Server) that monitors if one of the following exceptions on the level of microservice's JVM occurs:

  • Out Of Memory
  • StackOverflow

If one of these exception is detected Memory Process Keeper starts a new instance of microservice with 15% (by default) more resources for JVM while from the previous instance diagnostic data is gathered and archived (heap dump, class in which the exception occurred, a log from new instance startup). The point is to keep the service available as long as possible, increase capacity to prevent (or postpone) another occurrence of this incident and collect necessary data to find a root cause.

Figure 3. Self-healing - Memory Process Keeper (1).

The early detection of above exception allows to provide a new instance of microservice using zero downtime deployment mechanisms and protect service against timeouts or even unavailability.

After dumping is complete and new instance is active, the previous one is destroyed. In that way environment goes back to the original state with a little difference - one of microserices has more capacity for processing requests.

Figure 4. Self-healing - Memory Process Keeper (4).