Archive for May, 2011

Many considerations go into designing a scalable robust application infrastructure. Those considerations vary quite a bit from application to application and organization to organization. In fact, agreeing on the goals and constraints of the proposed system is typically the most important task in ensuring an efficient, relevant architecture.

When considering a MokaFive deployment, the following goals are typical and will be used to drive the example design covered in this post:

  • Minimize WAN traffic
  • Minimize user wait times for initial deployment and updates
  • Meet 4 hour SLA in the event of a server failure
  • Meet 24 hour SLA in the event of a site failure
  • Eliminate single points of failure within application
  • Support up to 2000 users

Meeting these goals will be achieved using the following system design components:

  • Dedicated database servers
  • Geographically distributed image store infrastructure
  • High availability configuration
  • Disaster recovery configuration

The result is focused on classifying each data source within the MokaFive system based on the amount of data it typically carries. Since the policy and reporting data transferred between management servers and client as well as between management servers and database servers is of a small size, those systems will be centralized with multiple systems provided for redundancy only. On the other hand the image stores carry, replicate and deliver larger amounts of data and are therefore designed with a distributed approach to minimize WAN traffic and delivery times in addition to providing disaster recovery and high availability.

Business continuity planning design

Before digging into the design, let me define the terms as I’m using them (these terms tend to be used to mean different things by different people):

  • Business continuity planning (BCP) – a process that creates a design taking into account a variety of potential risks and identifying approaches to mitigate as many of the risks as possible. The BCP guidelines are typically provided by the business in the form of required uptime and allowed downtime during incidents for different systems and data sources.
  • Disaster recovery (DR) – a configuration created to meet BCP requirements that supports risk mitigation during a significant incident, typically involving the temporary or permanent deactivation of a data center or site.
  • High availability (HA) – a configuration created to meet BCP requirements and provides rapid service resumption in the event of a local outage such as a server or component failure.

In the case of a MokaFive system, the ability for the system to recover from a local server or component failure (HA) or a site failure (DR) relates to the configuration of each of the following components:

Database – MokaFive uses a Microsoft SQL database to store policy, client and configuration data which is used to drive the implementation and management of clients and images.

Application server – all communication with the platform is managed by the application server. It is the primary contact point for clients, administration consoles and automation scripts.

Image stores – delivering the content of virtual images is performed by the image stores. Both primary and replica image stores are supported by MokaFive with the former being a read/write copy that is used for authoring and staging while the latter is a read-only copy typically used as a distribution point for clients.

The design of each of these components to support the hybrid centralized/distributed model will be covered in the following sections:

 

Database design

Database redundancy for both HA and DR leverages capabilities built into the MS SQL product. In order to keep costs down, this configuration is designed with the standard edition of SQL in mind.

High availability is achieved using a two node database cluster. This configuration does increase cost due to the need for shared storage but ensures minimal downtime in the event of a SQL server or component failure.

Disaster recovery to a second data center is achieved using log shipping which allows SQL to replay back copied logs on a stand-by database server. This choice avoids the need for SQL Enterprise edition, which is required to support asynchronous database mirroring, the other alternative for database redundancy across a WAN link.

 

Application server design

The application server component doesn’t store any data and as a result, very little needs to be staged in advance to support failover either locally within a data center or across data centers in a site failure scenario.

The installation media can be used to deploy the software on a warm server, which should be patched regularly and ready for the application deployment. The deployment does require manual intervention but is very simple to execute and should be configured to use the active database server and image store during installation.

Access to the application server by clients is provided using an alias DNS record (a CNAME) which is also used for the SSL certificate and configured within the MokaFive console. This configuration requires a simple additional step of manually modifying the DNS record in order to complete the failover process. This action can also be scripted.

In order to make sure that clients and replicas are deployed using this alias rather than the server FQDN, we simply need to modify the server’s DNS name entry in the iConfig administration console under the General tab in the Network section. The value should match the alias stored in DNS and used in the SSL certificates protecting the system.

Image store design

Configuring redundancy for the image store is primarily an exercise in file replication. The image stores – both primary and replica, are just a set of files that need to be available to clients and to the application server. There are two components that must work together to ensure the redundancy – availability of the primary image store and the ability of replicas and the Creator application to access the required information from the correct location as needed.

Maintaining availability of the primary image store can be accomplished with any file replication tool. I typically use Microsoft’s Distributed File System Replication (DFSR) because it’s built into the server I use and is efficient, secure and easy to configure. The latest version of MokaFive as of this writing, version 3.5, includes a new primary image store replication option that will likely negate the need for a separate replication tool going forward.

If for any reason, the application mechanism isn’t suitable, DFSR or another replication tool should do the trick just fine. Make sure to select a tool that supports replicating changes only because the image store tends to contain very large files that are only changed a little bit at a time.

Replicating the primary store to a second server within the same data center and a third server in the DR data center will create a topology that mirrors the database and application servers (in fact, the application server is often used for the primary image store).

Once the primary image store is redundant, we just need to make sure that the replicas and Creator can find their primary. This is achieved using the same alias based mechanism that ensures access to the application servers. If the primary image store is stored on the application server (my typical best practice), then no additional configuration is required. If the primary image store is on a dedicated server, it must be registered in the administration console using the alias name (in this case you will need a total of two aliases, one for the application server and one for the primary image store)

One big note: a lot of this configuration can be simplified when using a global application level load balancer but since many organizations do not have those, this approach serves as a better general best practice that can be used anywhere.