While the system is operational, exactly one watchdog process runs on each application server (henceforth referred to as “node”). Technically, this is a process in which TradeDesign Runtime System (e.g. td2soci) executes the transaction SYSWDR. This process can be started from an operating system function (e.g. crontab) at regular intervals (operation with daily maintenance window), or automatically when the node is started up (operation in 24/7 operation).
The watchdog process carries out the following tasks:
There is exactly one watchdog (i.e. a node) for which the 'Primary' flag (stored in SYN) is set. Only this watchdog maintains the database-related topics (reports about WFE, TRN, SMH, SPT…) and updates the overall error condition of the system in SYA with the result of the check run. Other reports (e.g., searching for dumps on the : file system) are executed on each node.
On the client side, load balancing is activated using the switch -u in command line of the Unicode Windows Client (td2uclnt.exe, Version 2.0.0.99 or higher).
If the client is started using the mit -u switch, the procedure is as follows:
Based on host and port parameters, the key created for the Windows Registry is:
[HKEY_CURRENT_USER\Software\Surecomp DOS GmbH\TD2 UClient\<host>\<port>]
If an entry is found under this key, the list of possible servers is read there; otherwise the server and port are read from the command line (usually used to initialise the list).
The client randomly selects an entry from the list and attempts to connect to this host/port.
If a connection cannot be established successfully, the next attempt is started, using another randomly selected entry from the list.
Attempts to establish a connection can fail for the following reasons in particular:
If the list of hosts contains only one element, (the first connection established from this computer), then the list of servers is extracted from the messages from td2srun, written to the Registry and used immediately. This permits a first-time log-in even on a locked system.
Each time a session starts, the list of host names stored in SYSWDR.INI, section [LoadBalancing], entry 'LBINIT', is transmitted from the server to the client once by the application via SetContext (“CLIENTPROPERTY”, “LBINIT” ) and stored in the Registry.
The startup script bin/unix/td2srun is started by inetd in order to start the server process of the application. td2srun first checks if the file ini/td2soff.txt exists, and if it does, that it does not contain the own IP address. In this case, the relevant text is issued via the socket (stdout by td2srun) and the connection is terminated. If td2soff.txt does not exist, or if it contains the own IP address (as an exception), the runtime system is started, and it first of all converts the socket connection into an SSL connection.
The watchdog SYSWDR uses the following tables:
SYA System Status Record
This table stores the error condition of the overall system in the Severity Flag of the record with subtype 'S' (red or green notebook in Office). Only the node where PRIFLG is set provides the SYA record with subtype 'S'. All other nodes write SYA records with subtype 'N' for the relevant node.
The following tables are only used if 'Automatic Load Balancing' is configured in the watchdog:
SYN Nodes ( one record per watchdog instance )
SYM running Managers ( one record per running Manager process )
SYK Server Key Figures (historicised Last Statistic)
Tables SYM + SYN + SYK are maintained exclusively by SYSWDR on the same node.
Exceptions:
The list of the required Managers is defined in an embedded codetable in module SYSMGRM and compiled in SYSWDR.tto and SYSMGR.tto.
Communication between the different nodes is exclusively via the tables SYN and SYM.
The communication between a watchdog process and the Managers started by that watchdog is via IPC (from watchdog to Managers) and by writing a status file in the file system by the Managers.
In addition, SYSWDR uses SSN entries and the Unix process list to determine information about the Managers running on the node.
If the communication from the watchdog to a Manager via IPC fails (and the Manager therefore no longer runs 'normally' and can no longer be controlled by the watchdog), the Manager is terminated using 'kill' or 'kill -9' via its process number, if required.
The watchdogs at regular intervals write the 'Key Figures' for their nodes (RAM, CPU load, swapping, time stamp, Lock flag, ServerID, etc tbd) as 'Server Status Record' to the table SYK in the DB.
The 'Key Figures' are determined by calling the shell script “getloadfigures”. This script implements the determination of the specific key figures of an installation and outputs them via stdout. Each time the Load Balancing evaluation is run, the watchdog calls this shell script and writes the result of each call to a record in the table SYK, e.g., for later analysis of the load pattern. The first of these key figures is considered the 'overall load' on the node, and it is checked against the configured threshold value for the load distribution algorithm. The node is considered overloaded exactly when the first key figure matches or exceeds the threshold value.
Key figures (suggestion):
When SYSWDR is started, the following actions are carried out:
After the start-up phase, the following three actions run periodically and independently of each in the watchdog process SYSWDR:
Configuration: configuration of the TSKLIST module (upper half of the screen)
Executes the topics marked as 'Periodic' in the 'Configuration' panel.
Only the node where the Priority flag is set carries out all the checks (i.e., including those that evaluate the database). Nodes without 'Priority' only execute topics marked as relevant for all nodes by 'RegisterTopicAllNodes' (e.g. search for dump files on the server: file system)
Configuration of the operating window in the lower half of the screen
Executes a check of the operating time window and the operation of the Managers every 60 seconds. Locks the local system for interactive sessions outside the operating time window and removes the lock within the operating time window.
Stops all locally running Managers when the end of the operating time window is reached.
Within the operating time window, starts all managers not identified as running on this or any other node.
The check to determine if all Managers are available, or which Managers need to be started (GetMgrInfo), is carried out in step 1 if topic 'Running Manager' is activated, and additionally in step2, as follows:
Reading the SSN records in the database determines all active sessions controlled via IPC (on any applications server) and which belong to the Managers (from SYSMGRM).
A check is carried out to determine if these Managers are still active by sending a status request via IPC and waiting for the .sta file generated by the Manager in reply. If a Manager is determined to be active by means of the status file, then normally the corresponding SYM record is updated ('R'unning and VALDATTIM (now + 3 * period durations)). It is only if the Requested Status is 'D'own in the SYM record for this Manager that 'D'own is sent to the Manager via IPC, and the SYM record set to 'C'anceled.
Configuration: last line
Execution:
SYSIXL (Control Manager)
Locking/unlocking can optionally be carried out for one specific node or for several nodes (applications servers). On the 'own' server, the lock files are written immediately, for all other applications servers SYN.REQSTA is set. The watchdog then locks these systems on the affected system.
SYSMGR (Login Control) 'D'own for the watchdog (SYSWDR) can optionally be sent to one node or to all nodes. IPC is used on the 'own' server. For all other nodes SYN.REQSTA is set, which prompts the watchdog on the relevant system to stop the Manager.
All other actions in SYSMGR only concern the own node (accordingly, status and log files, for example, can only be displayed from a SYSMGR process running on the same node.)
How is a node deactivated?
“Do not start watchdog (outside the application)”
How is an individual Manager restarted?
“SYSMGR, Button [Down]. Stops on the same node via IPC, on other nodes via SYM\REQSTA”
How is the overall system stopped?
”(SYN\REQSTA.is(“B”) for 'lock' interactive use and shutting down of the Managers is implemented in SYSWDR, but cannot be set interactively.)“