IT Live Ltd
IT Live Ltd
Resources » High Availability IT Live Ltd February 06, 2012
IT Live Ltd Login
 
IT Live Ltd
Menu
IT Live Ltd
 
Building a Network with High Availability in mind
IT Live Ltd

High Availability

Customers need to put key strategies in place to manage the IT systems, and if followed will result in saving money, more uptime, happier and more productive users and system administrators.

The following document will introduce some key items following an industry standard for approaching your IT systems.

Spend Money…but not blindly

Quality costs money, this is especially true for network and server equipment. More powerful computers cost less each year, whilst they grow in disk and memory capacity. This trend frequently creates the misconception that availability should somehow be cheaper and easier than in the past. This is NOT true, as availability is a quality issue and not a commodity issue.

Server failures do have a negative impact on the overall IT system and definitely has a cost impact on any company. This does not take into account the level of stress created on the system administrator when faced with an environment that appears to have no resolution.

When planning your IT systems, make sure you acquire the correct equipment, software and skills to produce the required results.

Assume Nothing

High availability does not come bundled with systems. Achieving a production level of end-to-end systems availability requires direct effort at redundancy engineering, processes for management, testing, integration, application assessments and quality administration.

Nothing can simply be dropped into an environment and be expected to add quality or availability. The opposite is actually true without the upfront engineering work being done. The correct constraints and bounds in all processes and systems are required to achieve results.

When you achieve availability you add constraints. In order for any business to place these constraints on the system, there needs to be a good understanding of what the systems are, what they are used for and how these systems serve their purpose. Once this is understood the common goal for availability can be reached.

Remove Single Point of failure (SPOF)

Single point of failure is a single component that should it fail, creates a certain degree of downtime. If one link in a chain breaks the quality of the remaining chain does not matter, the chain is broken.  There are obvious SPOF’s, such as servers, disks, network devices, cables, etc. Most of these components can be protected by redundancy.

There are other equally dangerous, second-order SPOF’s that also require consideration i.e. wide area network (WAN),  reliance on external services, DNS, internet service providers (ISP), etc.

In the quest for High Availability all these SPOF’s need to be identified, grouped and assessed to determine the probability of failure, service impact of failure and cost impact of failure and the cost involved to resolve them. Once these SPOF’s are identified they need to be dealt with accordingly.

Maintain tight security

Security is a critical part off any system, as a breach in security can directly or indirectly cause downtime. Network breaches can lock users out, delete, damage or change the integrity of files which can seriously impact on a business coupled with the fact that some of these events can go unnoticed for extended periods of time, the results could be disastrous.

Breaches can come not just from outside, but also from inside. Care needs to be taken against corporate espionage, and the system needs to be protecting against disgruntled employees and ex-employees.

  • Firewalls: A firewall is a gateway between the network and the rest of the world. It controls access to internal resources and is a critical part of any internet connected system.
  • Data Encryption: Data on tapes or on your network that is left unencrypted can be read by anyone with the means and desires to do so.
  • Strong Passwords: Make sure all passwords contain characters from at least 2 or 3 different character classes and should be no less than 7 characters long. These passwords should be changed on regular intervals.
  • Protect Passwords: Don’t give passwords to users or vendors who don’t truly need them, whether it is an admin access account or normal user account.
  • Limit Administrative Access: Restrict it very closely. Restrict access to your critical servers.
  • Enable Auditing: Auditing has a performance impact on a system but can help you track down intruders and keep track of system calls made by every running application.
  • Secure Access to the Equipment: This should be done in such a manner to control and deter unwanted access, but should not hinder the task of the administrators.
  • Secure Backup Tapes: Rebuilding and recovering data rely on the backup tapes being available and is a very important part of any systems. Thus these tapes should not be lying in the open and should be stored at a secure offsite location, possibly at north shore’s Plan-B company. These tapes also holds a security risk should they be available to third parties.

Security has many aspects, ranging from inbound access control, internal access control, user security, administrator account security, physical equipment access, theft, accidental damage, data security, etc.

Consolidate Servers

Rather than running systems on various smaller servers, as is the current case with many business today, consolidating services on larger, fewer servers with adequate capacity is advised in many production environments. This can significantly reduce the system complexity, results in fewer servers to be backed up and fewer things that can potentially fail. This also results in fewer machines to maintain.

Consolidation is however a powerful force for improving simplicity and manageability of an environment but with the added downside of having more of your proverbial eggs in a single basket. The moral of the story is “Go ahead and put all your eggs in one basket, just make sure the basket is build out of titanium-reinforced concrete”, hence the requirement for quality equipment.

Today the IT industry is provided with Virtualization which extracts the server software from the hardware it’s running on. This added advantage results in the two components being separate entities and failures in either can be handled without impacting on the other.

Document Everything

The importance of good, solid documentation simply cannot be overstated. Documentation provides audit trails to work that has been completed and provides guides for future system administrators and contractors to take over/support existing systems. Good documentation is a fundamental tool for resolving system problems.

Establish Service Levels and or Service Level Agreements

To better manage a disaster, many companies put written service level agreements in place with their service providers to define the levels of service they will provide.

Some of these areas include:

  • Availability levels
  • What percentage of the time are the systems actually up?
  • Hours of services
  • During what hours are the systems actually critical? And on which days?
  • Locations
  • Can all location expect the same level of service?
  • Are all locations accessible to all required persons during the hours set forth?
  • Priorities
  • What if more than one system is down at the same time?
  • Which systems will get tended to first?
  • Escalation policies
  • What if a service is not met?
  • Who gets called?
  • What are the time lines?

Without planning and responsibility, there will be no means by which any of the systems in use can be measured. In order for a service provider, internal or other, to meet certain criteria for service, these providers require authority and responsibility for every component that brings data from the servers to the users. If even one of these components is outside of their control, you cannot be assured of meeting expected levels of service and if that component fails, who will fix it?

Test Everything

An integral part of DR plans (Disaster Recovery) is the testing of all components involved, i.e. systems, software, hardware, applications, modifications, etc.

These tests should be done at all the levels. These would include the unit level (what happens if a disk fails?), the subassembly or subsystem lever (what if we unplug this disk array or pull out this network connection?), the complete system (what is this server goes offline?) or application layer (what if this software fails?), as well as on a full end-to-end basis.

Tests need to be repeated on a regular basis with well documented results and with the adequate change controls. The only way to keep up with a dynamic changing environment is to test the systems supporting it.

Maintain Separate Environments

It is well advised to keep your production and development environments separate and independent. This entails not just the servers, but the networks and the users as well if possible. Be prepared to operate different environments with varying levels of control that is safe, acceptable and adequate for the specific environment.

The following will list some of the environments that will need to be considered:

  • Production
    Production is the main backbone of the company and changes are only carried out with the correct change control and documentation, as this is a 24/7 part of the environment. If a change is not successful, rollback to the previous working state should be possible. This can only be done if everything has been documented and the changes are known.
  • Development
    Clearly this environment is for works in progress. Changes still need to be monitored fairly closely. This environment is there to support developers, and provides them with a platform to test software that might not be working as expected, without effecting the production environment.
  • Laboratory
    There will be an expected level of testing on new and future software, new hardware and generally new technologies. A Laboratory network is completely isolated from any other networks and provides a true playground so that new systems can be properly evaluated and without affecting any thing else.

Invest in failure Isolation

Failure isolation means that one problem does not bubble over and affect something else. You are able to contain failures close to where they occur so that they can be identified and resolved before they spread.

Examine the history of the systems

To better your system you need to make changes on it depending on the history of the system. For a customer to manage its environment they need to put some form of system measurement in place.

The following will list some of the key questions that need answering:

  • Why does your system go offline?
  • What are the common causes?
  • Are there patterns to the failures?

Customers must account for every incident that causes downtime regardless of its significance.

Copyright © 2004-2006 IT Live LTD. All Rights Reserved.  | Terms of Trade | Privacy Statement