Watchdog Mechanism
Each watchdog executes the following actions:
- Control its own Tomcat service and database; write information into its logs.
- Communicate between each other and notify via heartbeat if a Tomcat is down.
- Log the connection time to each instance.
If a master watchdog is down:
- The next member takes on the mastership, after checking the down status
If a master instance Tomcat is down:
- The watchdog notifies other watchdogs.
- The next member takes on the mastership directly.
Detailed table:
Service | Master | Member | Action |
---|---|---|---|
Tomcat | Up | Down | Watchdog takes no extra action. |
Tomcat | Down | Up | 1- Master watchdog sends a fail acknowledgment to another watchdog. 2- Member watchdog will initiate a failover scenario. |
Tomcat | Down | Down | Watchdog takes no extra action |
Watchdog | Up | Down | Watchdog takes no extra action |
Watchdog | Down | Up | 1- Member will take on the mastership after checking the down status three times (parametric) 2- Member watchdog will initiate a failover scenario . |
Watchdog | Down | Down | Watchdog takes no extra action. |
Wathdog, Tomcat (Happy path) | Up | Up | Master and member watchdog will send heartbeat and acknowledge information to each other and log all communication. |
There are three services for the following actions:
- A service checks the Tomcat status. It sends a request to the DB; if it receives an answer (if there is a connection), the service returns an OK.
- A service runs the parameters, including instance information and threshold values related to the watchdog mechanism:
- Instance name, IP, and priority order from t_instance table The threshold value of how many times it will check if an answer from the watchdog cannot be received, and how often the watchdogs will communicate.
- Those two values will be in the t_system_parameter table and will start with the watchdog.
- A service will update the master instance information.
Replication between each Kron PAM instance is controlled periodically by watchdogs, and SAPM jobs can also be checked and stopped, if necessary, along with the replication status for disabled SAPM jobs.
Disabled SAPM job scenarios are outlined in the following table:
Service | Master | Member | Action |
---|---|---|---|
Tomcat & Database | Up | Down | Each watchdog stops SAPM jobs in their instance. |
Replication Status | Up | Down | Each watchdog stops SAPM jobs in their instance. |
Watchdog | Up | Down | Active watchdogs stop SAPM jobs in their instance. |
Tomcat & Database | Down | Up | Each watchdog stops SAPM jobs in their instance. |
Replication Status | Down | Up | Each watchdog stops SAPM jobs in their instance. |
Watchdog | Down | Up | Active watchdogs stop SAPM jobs in their instance. |
Tomcat & Database and Replication (Happy path) | Up | Up | Master and member watchdogs keep watching services and log all communications. |