Table of Contents

Load Balancing

Watchdog Functions

While the system is operational, exactly one watchdog process runs on each application server (henceforth referred to as “node”). Technically, this is a process in which TradeDesign Runtime System (e.g. td2soci) executes the transaction SYSWDR. This process can be started from an operating system function (e.g. crontab) at regular intervals (operation with daily maintenance window), or automatically when the node is started up (operation in 24/7 operation).

The watchdog process carries out the following tasks:

  1. Periodic checking of selected topics and, if applicable, provide information about an error condition
  2. Monitoring of the operating time window of the application, with start and stop for interactive processes and Manager
  3. Start missing Managers on own node
  4. Control of (other) background processes (Managers)
  5. Determination and registration of the load condition of own node
  6. Block or release own node for interactive links in order to distribute load
  7. Detect watchdog processes on other nodes that are not running properly and reset data no longer updated by those watchdogs via the relevant other node
  8. Integrate the error condition from the periodic checks of all nodes into the overall status of the system (SYA.SEVFLG)
  9. Update the list of all hosts in the configuration information for the clients. (Manual configuration or distribution of the hosts is not required thanks to this mechanism.)

There is exactly one watchdog (i.e. a node) for which the 'Primary' flag (stored in SYN) is set. Only this watchdog maintains the database-related topics (reports about WFE, TRN, SMH, SPT…) and updates the overall error condition of the system in SYA with the result of the check run. Other reports (e.g., searching for dumps on the : file system) are executed on each node.

Starting an Interactive Session

On the client side, load balancing is activated using the switch -u in command line of the Unicode Windows Client (td2uclnt.exe, Version 2.0.0.99 or higher).

If the client is started using the mit -u switch, the procedure is as follows:

Based on host and port parameters, the key created for the Windows Registry is:

[HKEY_CURRENT_USER\Software\Surecomp DOS GmbH\TD2 UClient\<host>\<port>]

If an entry is found under this key, the list of possible servers is read there; otherwise the server and port are read from the command line (usually used to initialise the list).

The client randomly selects an entry from the list and attempts to connect to this host/port.

If a connection cannot be established successfully, the next attempt is started, using another randomly selected entry from the list.

Attempts to establish a connection can fail for the following reasons in particular:

  1. Host name cannot be resolved (DNS)
  2. Socket connection to the indicated port cannot be established (server is not running, or inetd is not configured accordingly)
  3. The connection is established, but a text containing 'Load Balancing' is received at the socket, and the connection is terminated again by the other party (td2srun).

If the list of hosts contains only one element, (the first connection established from this computer), then the list of servers is extracted from the messages from td2srun, written to the Registry and used immediately. This permits a first-time log-in even on a locked system.

Each time a session starts, the list of host names stored in SYSWDR.INI, section [LoadBalancing], entry 'LBINIT', is transmitted from the server to the client once by the application via SetContext (“CLIENTPROPERTY”, “LBINIT” ) and stored in the Registry.

The startup script bin/unix/td2srun is started by inetd in order to start the server process of the application. td2srun first checks if the file ini/td2soff.txt exists, and if it does, that it does not contain the own IP address. In this case, the relevant text is issued via the socket (stdout by td2srun) and the connection is terminated. If td2soff.txt does not exist, or if it contains the own IP address (as an exception), the runtime system is started, and it first of all converts the socket connection into an SSL connection.

Data model and communications mechanisms in the applications server

The watchdog SYSWDR uses the following tables:

SYA System Status Record

This table stores the error condition of the overall system in the Severity Flag of the record with subtype 'S' (red or green notebook in Office). Only the node where PRIFLG is set provides the SYA record with subtype 'S'. All other nodes write SYA records with subtype 'N' for the relevant node.

The following tables are only used if 'Automatic Load Balancing' is configured in the watchdog:

SYN Nodes ( one record per watchdog instance )

SYM running Managers ( one record per running Manager process )

SYK Server Key Figures (historicised Last Statistic)

Tables SYM + SYN + SYK are maintained exclusively by SYSWDR on the same node.

Exceptions:

The list of the required Managers is defined in an embedded codetable in module SYSMGRM and compiled in SYSWDR.tto and SYSMGR.tto.

Communication between the different nodes is exclusively via the tables SYN and SYM.

The communication between a watchdog process and the Managers started by that watchdog is via IPC (from watchdog to Managers) and by writing a status file in the file system by the Managers.

In addition, SYSWDR uses SSN entries and the Unix process list to determine information about the Managers running on the node.

If the communication from the watchdog to a Manager via IPC fails (and the Manager therefore no longer runs 'normally' and can no longer be controlled by the watchdog), the Manager is terminated using 'kill' or 'kill -9' via its process number, if required.

Load Determination

The watchdogs at regular intervals write the 'Key Figures' for their nodes (RAM, CPU load, swapping, time stamp, Lock flag, ServerID, etc tbd) as 'Server Status Record' to the table SYK in the DB.

The 'Key Figures' are determined by calling the shell script “getloadfigures”. This script implements the determination of the specific key figures of an installation and outputs them via stdout. Each time the Load Balancing evaluation is run, the watchdog calls this shell script and writes the result of each call to a record in the table SYK, e.g., for later analysis of the load pattern. The first of these key figures is considered the 'overall load' on the node, and it is checked against the configured threshold value for the load distribution algorithm. The node is considered overloaded exactly when the first key figure matches or exceeds the threshold value.

Key figures (suggestion):

Watchdog processes (SYSWDR) in detail

When SYSWDR is started, the following actions are carried out:

After the start-up phase, the following three actions run periodically and independently of each in the watchdog process SYSWDR:

1. Periodic check

Configuration: configuration of the TSKLIST module (upper half of the screen)

Executes the topics marked as 'Periodic' in the 'Configuration' panel.

Only the node where the Priority flag is set carries out all the checks (i.e., including those that evaluate the database). Nodes without 'Priority' only execute topics marked as relevant for all nodes by 'RegisterTopicAllNodes' (e.g. search for dump files on the server: file system)

2. Control of interactive and background processes

Configuration of the operating window in the lower half of the screen

Executes a check of the operating time window and the operation of the Managers every 60 seconds. Locks the local system for interactive sessions outside the operating time window and removes the lock within the operating time window.

Stops all locally running Managers when the end of the operating time window is reached.

Within the operating time window, starts all managers not identified as running on this or any other node.

The check to determine if all Managers are available, or which Managers need to be started (GetMgrInfo), is carried out in step 1 if topic 'Running Manager' is activated, and additionally in step2, as follows:

Reading the SSN records in the database determines all active sessions controlled via IPC (on any applications server) and which belong to the Managers (from SYSMGRM).

A check is carried out to determine if these Managers are still active by sending a status request via IPC and waiting for the .sta file generated by the Manager in reply. If a Manager is determined to be active by means of the status file, then normally the corresponding SYM record is updated ('R'unning and VALDATTIM (now + 3 * period durations)). It is only if the Requested Status is 'D'own in the SYM record for this Manager that 'D'own is sent to the Manager via IPC, and the SYM record set to 'C'anceled.

3. Load Balancing Evaluation

Configuration: last line

Execution:

Interactive Control

SYSIXL (Control Manager)

Locking/unlocking can optionally be carried out for one specific node or for several nodes (applications servers). On the 'own' server, the lock files are written immediately, for all other applications servers SYN.REQSTA is set. The watchdog then locks these systems on the affected system.

SYSMGR (Login Control) 'D'own for the watchdog (SYSWDR) can optionally be sent to one node or to all nodes. IPC is used on the 'own' server. For all other nodes SYN.REQSTA is set, which prompts the watchdog on the relevant system to stop the Manager.

All other actions in SYSMGR only concern the own node (accordingly, status and log files, for example, can only be displayed from a SYSMGR process running on the same node.)

FAQ

How is a node deactivated?

“Do not start watchdog (outside the application)”

How is an individual Manager restarted?

“SYSMGR, Button [Down]. Stops on the same node via IPC, on other nodes via SYM\REQSTA”

How is the overall system stopped?

  1. “SYSIXL,Lock for <all> applications servers.
    Prompts watchdog on all nodes via SYN\REQSTA to stop the system.”
  2. “SYSMGR, Manager SYSWDR, <all> Application Server Button [Down]. Stops watchdog on the same node via IPC, on other nodes via SYN\REQSTA” A watchdog receiving a 'Down' stops all the Managers it controls.“

”(SYN\REQSTA.is(“B”) for 'lock' interactive use and shutting down of the Managers is implemented in SYSWDR, but cannot be set interactively.)“