Continuity of Business In an MVS Environment
By: Mitchell H. Levine, CISA
Audit Serve, Inc.
The explosion at the World Trade Center on February 26, 1993 has once again reminded us of the need to have an offsite processing facility along with a proven plan which effectively restores the system within the desired business timeframes. However events like these also necessitates the re-evaluation of contingency plans to ensure that all potential issues are covered. This is the basis of this article.
There are many options available for offsite processing which include:
hot sites provided by disaster recovery vendors (e.g., SunGuard, Comdisco, and IBM) which offer offsite processing facilities on a first come first serve basis
a dedicated site, whereby a company has an alternate data center which contains a duplicate set of hardware which is not used for any daily business purposes
a cold site, which consists of an offsite facility containing an empty room that has a raised floor and air conditioning unit. In this situation, arrangements are made with hardware vendors to supply equipment within a specific timeframe upon the declaration of a disaster
divided operations, which consist of one processing center operation being divided into two separate data centers
The main focus of this article is the planning and execution of an offsite contingency plan which uses a disaster recovery vendor's hot site. This article is not intended to provide a checklist of required components of a contingency plan. It will provide a thorough understanding of what actually occurs at an offsite processing center and the steps that a company must perform to ensure a successful restoration of operations at a disaster recovery vendor's hot site.
Offsite Processing Environment Provided by Disaster Recovery Vendors
A disaster recovery vendor's purpose is to provide offsite processing services to facilitate the transfer of a company's operation from their home site to the hot site (i.e., offsite processing center) in the most time efficient manner. At the same time, the vendor must also obtain maximum utilization of the equipment by servicing as many companies at the same time within the same system.
To achieve the latter, disaster recovery vendors utilize PR/SM (Processor Resource System Manager) architecture which allows LPARS (i.e., logical partitions) to be defined. This enables multiple companies to operate concurrently one machine, but also ensures that each company is isolated from the other.
The other method that is used to allow multiple companies to operate within the same system, using the VM operating system to define multiple MVS guest operating systems.
Disaster recovery vendors have made improvements in the last year in terms of providing facilities which expedites the time required to restore one's system at the hot site. One of these improvements was providing companies with a floor system which is IPLed with the minimum required system software to allow companies to restore their environment.
The floor systems typically include MVS, JES, TSO, SDSF and is configured for every piece of equipment contained within the environment. However, the floor system is only a generic environment and cannot be used as the system configuration for a company's daily processing environment. Therefore, once the company's environment has been restored, the system is re-IPLed with the company's own MVS system which is tailored for their processing requirements.
Prior to the availability of floor systems, the disaster recovery vendors provided a chunk of hardware with no operating system loaded. This required each company to bring their own system software and perform an initial IPL from a one pack MVS system in order to perform their restoration. The one pack MVS system is referred to as a rescue pack, which consists of one or two tape volumes containing the system datasets (e.g., PARMLIB, master catalog, JES Spool and checkpoint datasets, and page datasets) required to perform an IPL and a copy of FDR or DFDSS which is used to perform the restoration. Since the system environment defined within the rescue pack does not reflect the operation required to run the normal system, the system will need to be re-IPLed once the system has been completely restored. It should be noted that the use of rescue pack is also used for onsite contingency planning in the event of a failure within the System Resident Volume (SYSRES).
The use of the floor system saves a significant amount of time (i.e., approximately one hour) that would have been required to load the tapes and perform standalone IPL. It should also be noted that all disaster recovery vendors do not provide floor systems.
Offsite Facility Compatibility
The probability that a disaster recovery vendor will provide an offsite facility that matches the exact equipment requirements of its customers is remote. Each company should analyze these differences to determine the impact to its offsite contingency strategy. One difference in the processors between a company's home site and a disaster recovery vendor's offsite processing facility would not be a major compatibility issue as long as the company's memory and CHIPIDs requirements are met.
However, the differences in DASD can be a major problem if the offsite processing center's DASD uses different device types or if its DASD has a lower density then a company's home site. For instance, it would not be possible to restore a full volume backup from a company's home site's triple density DASD volume to a hot site's single density volume.
In addition, the hot site should have an equivalent amount of DASD required by a company to perform its production processing. This includes work packs (i.e., used for sort work space, temporary datasets), since during the batch processing cycle, the workspace requirements remain the same. The DASD used to support software development would not be required to be available at the hot site since software development is not usually required by an offsite contingency plan.
Planning for a disaster, which necessitates a transfer of operation from a company's home site to the hot site (i.e., offsite processing center), does not culminate with the development of a contingency plan. Components used in the restoration process are constantly changing. This section of the article will discuss such preparation requirements.
The success of the restoration operation at an offsite processing center is predicated on a complete backup of the home site's production environment, which has integrity. No matter how comprehensive the contingency plan or the number of times that a contingency plan is tested, there is no method available for creating data for a missing backup tape or a corrupted backup. There are several approaches used to perform a system backup. The first method is a full volume backup which requires a number of magnetic tapes (i.e., a backup of a triple density volume requires up to six tapes depending on the type compression used by the controller).
Most installations do not perform full volume backups on a daily basis since their environments are not constantly changing or based on the amount of time required to perform a full backup. Therefore, incremental backup are performed for those days in which a full volume backup is not performed. Installations typically use products such as FDR or HSM, whose catalog identifies the datasets that have changed and therefore require a backup.
Both methods described are backups initiated by the operations area which is not familiar with the activity that occurs within the applications themselves. This point is critical since files will be corrupted if a backup occurs while the files are open. This control issue applies to all mentioned backup strategies that are used. It is the reason that the most effective backup strategy for application systems is to have the application development staff (i.e., who have knowledge of which files are at open at specific times), define the backup process. This will ensure that backups which have integrity are performed.
In the near future, IBM will release a new method for performing backups which provide integrity through the use of concurrent copying. IBM placed into the control unit the ability to take a backup of a file while it is open and still maintain integrity.
Since the hardware used at the hot site is different from the home site, the disaster recovery vendor will send each of the their customers a copy of the MVSCP GEN that is running at the hot site as part of an ongoing maintenance schedule. The MVSCP GEN defines all of an installation's devices such as console addresses that you IPL from, DASD, printers, terminals, and controllers.
Companies will review the MVSCP GEN to determine the addresses used by the hot site. If the hot site's hardware addresses are not contained in the MVSCP GEN, JCL errors will occur since the system will not recognize the unit names (e.g., tape drive) that are coded in the installations' jobs. Therefore, installations must blend the hot site's unit names into the GEN since the unit names that are hardcoded into a company's JCL cannot be changed. This is accomplished using the Eligible Device Table (EDT) within the MVSCP GEN which allows an installation to reference multiple addresses, including hardware addresses that are defined at the home site and the hot site. The EDT maps the unit names used by one's installation to the equipment addresses which the disaster recovery vendor has supplied.
When restoring to a hot site, which has either a floor system or an installation standalone IPL system, the MVS CPGEN would be required to be rerun to define (i.e., GEN) the addresses of the hot site along with the EDT.
As previously mentioned, in order to run your installation's system in a different location without changing the operating environment, provisions must be made to allow a system to address the hardware located at the hot site. This situation involves the addressing of DASD by one's system. For example, catalogs are used to locate the DASD on which datasets reside or to identify the volumes in which APF libraries are stored. Since DASD addresses are different for the home site and hot site, installations must alter the volume ID record on the VTOC of the DASD used at the hot site to the addresses used by an installation at their home site. This technique is referred to as "clipping the pack". Typically installations have JCL prepared which perform this function.
The preparation step which requires the most constant change is the JCL used to restore the system. The JCL which calls the backup tapes, changes each day since the tape volser for each night's backup changes. Most installations have a product or have devised their own automated process for creating restore jobs.
Other Preparation Requirements
The following items should also be considered when making preparations for restoring a system at a hot site:
special JES definitions (JES2PARM) required which reference EP lines and printers that are used by the hot site
special CONSOLxx PARMLIB member required to define the console addresses used by the hot site
special NCP GEN required for the contingency site since the telecommunication lines are mapped differently at the hot site
Other Offsite Contingency Planning Considerations
When developing a contingency strategy for restoring one's system at an offsite processing center, a tape management system is the best organizational tool to ensure that the proper tapes are sent and recalled to and from the offsite media storage facility.
When reviewing the process used for sending tapes to the offsite storage facility, provision should be made to ensure that the JCL and other support items required to restore a system at the hot site are also shipped to the offsite storage facility.
All good recovery plans include a plan of how to restore the system back to one's home site when the disaster is over. This critical plan is overlooked by most contingency plans. Therefore, careful analysis is required to ensure that the methods used to operate your system at the hot site is compatible to the operating requirements at the home site. For example, if your installation is restoring its system to the hot site's DASD which contains more space then your home site, then the additional space will be used by the installation as datasets expand in size based on normal daily processing. When it is time to return to the home site, a full volume backup of the DASD at the hot site is performed. However, the DASD at the home site will not have the space available to perform the restore. The solution for such a situation would be to allocate a dummy dataset to the DASD at the hot site which fills the space created by the differences in the two site's DASD capacity.
It should be noted that when a restore is performed at the hot site, the disaster recovery vendor's floor system has security installed, but is bypassed. Security will not be in effect until an installation re-IPLs its system with its installation's configuration which occurs after all of the restores have been performed.
Many third party vendor system software products used by an installation have controls which prevent the software from being used on a different CPU. Most of these vendors provide a facility to allow their software to be temporarily used on a different CPU by providing a vendor zap. Typically these zaps will only function for a set period of time. Vendors do not provide these facilities in advance except in order to perform a contingency test. Therefore, steps to contact the vendor and perform the zap should be included in the contingency plan. All contingency plans should contain exact steps needed in order to restore the system. The most effective contingency plan identifies various steps which can be performed concurrently and which are not dependant on other tasks. This is important since the objective is to restore the system in the least amount of time possible.
The information provided in this article is intended to provide a background of critical processes and functions required to restore a system at an offsite processing center. The contingency plan should contain procedures for performing these critical functions. However, there are various methods that can be used to perform these functions which should considered when reviewing a contingency plan.
This article was written more than five year ago. Events may have changed since this article was written.
For a free proposal to perform an audit of your organization or provide SOX support & testing services, contact Mitchell Levine of Audit Serve at (203) 972-3567 or via e-mail at Levinemh@auditserve.com.
Copyright 2006, Audit Serve, Inc. All rights reserved. Reproduction, which includes links from other Web sites, is prohibited except by permission in writing.