Installing Fault Tolerance
The basic principle of fault tolerance is to keep your production schedule running continuously despite machine failures. If the machine managing your production schedule fails, fault tolerance ensures that another machine is available to assume control over the production schedule.
Note: Fault tolerance does not protect against database failures. This best left to your database administrator who can set up data mirroring based on the database type.
Overview of Fault Tolerance
This section describes fault tolerance components, configuration, operational modes, and network configuration.
Fault Tolerance Components
Fault tolerance consists of these main components:
-
Client Manager – The Client Manager services requests from user initiated activities, such as through the TA Web Client.
-
Primary Master – The Primary Master controls production scheduling during normal system operations.
-
Backup Master – The Backup Master operates in standby mode until it takes over for the Primary Master. In case of a failover, the Backup Master becomes active and clients reconnect to the Backup Master.
-
Fault Monitor – The Fault Monitor continuously monitors the status of the primary and Backup Masters. It initiates the transfer of scheduling control from the Primary Master to the Backup Master. The TA Web Client provides an interface to the Fault Monitor service.
The Primary Master and the Backup Master are designed to communicate with a database. The database administrator is responsible for setting up and maintaining the database. TA does not offer fault tolerance for the database.
Fault Tolerance Configuration
Fault tolerance has two configuration modes: auto mode and fixed mode.
Auto Mode
Auto mode is the default way of configuring fault tolerance. If the Primary Master fails and the Backup Master assumes control, then the Backup Master assumes the active role. When the Primary Master that failed comes back online, it remains in standby mode. This type of fault tolerance does not care if the original Primary Master is actively controlling the production or if the configured Backup Master is in control. Regardless of the original configuration, each Master is interchangeable and can operate in either an active or standby mode. See also, Fault Tolerance Operational Modes.
Fixed Mode
In fixed mode, if the Primary Master fails, the Backup Master assumes control just as with auto mode. However, in fixed mode, when the Backup Master assumes control, it continues the production schedule until control is manually switched back to the Primary Master. During the time the Backup Master controls the production schedule, fault tolerance is disabled. Fault tolerance is enabled again when the Primary Master resumes control. In the backup mode, fault tolerance is disabled because the Backup Master does not have a backup.
During a failover, the green light beside the Fault Monitor name (located in the first column of the Connections pane) turns red. This light indicates that fault tolerance is not operating.
The status lights warn users that without Master redundancy, the network is vulnerable to failure. Returning the Primary Master to service and restoring your system to a normal fault tolerant status should be the highest priority. Use the switch back procedure to return the Primary Master to service. See Primary Master Switchback.
In the Unix installation procedure after providing a directory location for the installation files, a screen asks if you wish to install a Primary Master or Backup Master. You should install the Primary Master first. Complete the Primary Master installation and then repeat the Master installation on a different machine, selecting the Backup Master option for the second installation.
Note: For more information on installing the primary and Backup Masters for Unix, refer to Installing the TA Master for Unix.
Fault Tolerance Operational Modes
Whenever the Primary Master is running while the Backup Master remains available to assume control, the system is in standby mode. If the Primary Master is unable to run, control of the production schedule passes to the Backup Master ensuring uninterrupted production. Whenever the Backup Master assumes control from the Primary Master, the system is in backup mode.
Normal (Standby) Mode
The diagram below shows normal operation, or the standby mode. The Backup Master remains in the background until required though maintaining constant communication with the Primary Master and the Fault Monitor.
Backup Mode
The diagram shows fault tolerance operation when the Primary Master goes down (backup mode). The Backup Master becomes active, assuming control of the production schedule while the Primary Master is out of service. Both figures show only the main components of fault tolerance.
Fault Tolerance Network Configuration
For fault tolerance to operate properly, the physical network connections between the various components must be configured properly for reliable communication. Dedicated TCP/IP communication ports, configured during installation, are used to exchange messages between components and to verify whether the connections are up or down. The diagram shows the network connections and the communication ports.