Load Balancing and Fault Tolerance
- an efficiently approach
Author
|
Version
|
Date
|
Remarks
|
|
0.5
|
18 May 2002
|
First created
|
1Abstract2
2Concepts. 2
3Design
Proposal2
3.1Generic
approach. 2
3.1.1Structure
of a Cluster Node. 3
3.1.2Communication
protocol4
3.2Integration
with JOnAS. 5
3.2.1Cluster
topology. 5
3.2.2Using
the cluster5
4Requirements. 5
5Roadmap. 5
6Open
questions. 6
1 Abstract
The purposes of this document is to present
a solution for assuring, load balancing and fault tolerance in a server
cluster. For achieving those targets the following topics will be discussed:
·Defining
the concepts related to this topic
·Propose
a solution that must be flexible and quick to implement
·Define
the requirements for the first release of the load balancing service for
JOnAS
·Establish
the road map of the implementation
2 Concepts
Clustering
- connecting two or more computers together in such a way that they
behave like a single computer. Clustering is used for parallel
processing, load
balancing and fault
tolerance.
Node –
a member of a cluster
Load
Balancing - Distributing processing and communications activity evenly
across a computer network
so that no single device is overwhelmed. Load balancing is especially important
for networks where it's difficult to predict the number of requests that
will be issued to server. Busy application servers typically employ two
or more servers in a load-balancing scheme. If one server starts to get
swamped, requests are forwarded to another server with more capacity.
Fault
Tolerance - The ability of a system to respond gracefully to an unexpected
hardware or software failure
Failover-
A backup operation that automatically switches to a standby database, server
or network if the primary system fails or is temporarily shut down for
servicing Failover automatically and transparently to the user redirects
requests from the failed or down system to the backup system that mimics
the operations of the primary system.
LBFTS
- abbreviation for "Load balancing and fault tolerance service",
used in this document
3 Design
Proposal
3.1 Generic
approach
I
propose a solution, for developing a service with support for load balancing
and fault tolerance that can be easily integrated with various server applications.
My solution will solve the following problems:
1.A
client request, will be dispatching to the server which can satisfy the
request in optimal time
2.All
the nodes for the cluster will have the same state. This mean any change
on the state of a node, will be automatically reflected to the other nodes
of the cluster
3.Any
server failure will be transparent for the client, because the client proxy,
will dispatch the request to another node from the cluster
4.There
will not be, single point of failure in the system. So there will not exist
a master node, or something like this
5.The
load balancing strategy, the communication protocol, and the storage of
the node states, will be customizable (can be implemented using various
software toolkits, various strategies and algorithms)
3.1.1 Structure
of a Cluster Node
The following diagram describes
a generic view of the Node component:

The components of the cluster
node are described above
3.1.1.1 Transport
layer
This is the component that assures communication
between the nodes of the cluster. The transport layer, must assure a reliable
asynchronous communication, and guarantees the order of the messages. In
the first version of the LBFTS an implementation based on javagroups will
be provided.The transport layer
offer services for publishing asynchronous messages, and for register handlers
to messages published by other cluster's node.
3.1.1.2 State
manager
This component will manage the state of
the node. The state manager will be responsible for storing and retrieving
the objects that are subject of data replication. Every object, which will
support replication of its state, must have assigned a unique id. This
can be compute in various ways. The state will be a map with the object
id as key, and the object itself as value. For the first version the state
manager will be a simple wrapper over the Hashtable class. In future version
some advanced state managers could be provided, to avoid memory overloading,
or other problems. (e.g. the state will be saved in some persistent storage,
and a part of him will be load in memory, in a cache object)
3.1.1.3 Load
factor evaluator
This component will evaluate
the load factor of the server, and will be used to decide in the load balancing
process, which is the server that will handle a request. Can be various
strategies for computing load-balancing factor, so will be various implements
for this component.
Question: Which load balancing
strategy should be implement for the first version???
3.1.2 Communication
protocol
There are three major scenarios,
which this solution must prove. I will describe the workflow for all of
them, but in the beginning I will describe the initialization of a node.
3.1.2.1 Initialization
of a node
We suppose that a new node
is inserted in the system. This node doesn't know if there are other nodes
in the cluster. The initialization will have the following steps:
1.Read
a configuration file
2.According
with the information form the configuration file, the LBFTS, instantiate
the implementation for the transport protocol, the state manager and the
load factor evaluator.
3.The
LBFTS is registered as message listener to the transport implementation,
and send a getState request with a defined timeout.
4.When
the setState callback is called, the state of the node will be initialized,
with the received state
5.If
the setState callback is not called in the predefined timeout the
node will consider, that is the first in the system, so the initial state
will be empty
3.1.2.2 Load
Balancing
The following approach
can be use the achieve load balancing.
1.A
server with a LBFTS started receive a request for a remote object
2.The
server publish a GET_LOAD_FACTOR request
3.All
the nodes respond to the request
4.Comparing
the load factors the server decide which node of the cluster will return
the remote object
5.The
client receive the JNDI name of the remote object
6.The
client make the request for an object with the new JNDI name
7.The
server return the remote object
Note: this is a solution for having load
balancing only on the lookup phase. A solution for dynamically load balancing
is not in the purpose of the document, even that the replication mechanism
proposed here, could support dynamically load balancing too.
3.1.2.3 State
replication
The state replication is an implicit requirment of the system, for assuring
fault tolerance. This feature of the system must assure if an object, is
modified on a cluster node, on the other nodes, the change will be reflected
too.
1.The
node on which an object that is subject to state replication, was changed,
calls the required method (setEntry or removeEntry) of the
Node class (setEntry will be call when an object is created, too).
2.In
the setEntry or removeEntry method of the Node class,
the local state of the object is changed, using the right method
of the State implementation class and after that, the node will publish
a message of the requested type (SET_ENTRY or REMOVE_ENTRY)
3.The
other nodes of the cluster will receive the message, and the onMessage
callback
handler will be called.
4.In
the callback handler of the Node class the related method of the
implementation for the State interface will be called (setEntry
or removeEntry)
For the first release the state of the object passesd over the network,
will be the the serialized form of the object. In the next release a most
advanced tehnique must be used(something like a light serialization, or
maybe some XML messages). There is an other possibility for state replication:
the message will specify the object, the method (or methods) that are called,
and the parameters, and every copy of the object, will call the specified
methods, using reflection. This could be sometimes, a great optimization,
but sometimes can generate poor performance(if the method take too much
time). I thing, a mixed approach(light serialization and method call),
configurable for each bean at the deploy time could be a succesfully solution.
3.1.2.4 Fault
tolerance
Few strategies can be used
to solve the fault tolerance scenario. I will separate in the categories,
and I will describe both approaches.
1.Achieving
load balancing using multicast on the client side
This solution suppose, that on the client machine there is support for
IP multicast. To assure fault tolerance, the remote reference of the server
object must use the follwing algorithm when call a remote method(this algorithm
assume that the client proxy knows the id which, uniquely identify
the object from the server side):
-
The remote methos is invoked
-
If a communication exception occurs, the client send a multicast
request, in which ask for an object with an specified id
-
All the servers which can offer that object respond, with a message,
that contains the JNDI name of the object, and the server's load factor.
-
The client decide, which remote object should be use, and recall
the method for the new remote object
2.Achieving
load balancing without multicast on the client side
The
first solution proposed, has as drawback, the requirment, that the cliet
side, must
know a reliable multicast protocol. So we propose know
a solution, without this drawback. The ideea for this solution is that,
at the first lookup, the client will retain the name of all the remote
object which are type that it request.
-
When the lookup is make, the client receive a list with all the objects
which have the type required by it
-
When a communication error occurs, the client will try to use an
other object from the list
The bad points fore this
solution are that, if a new node is integrated in the system, all the clients
couldn't use the new node for already created remote object
in the system, althought the new node can have the state of the new object..
Question: Which solution should be used for the first release???
3.2 Integration
with JOnAS
3.2.1 Cluster
topology
For providing
a server cluster, the machine on which the clusters node will run must
respect the following rules:
All the machine from the cluster, must be in the same local area network
The machine must be reachable by IP multicast
All the machine must have installed the same version of the JOnAS
server
All the JOnAS servers must have started a LBFTS
3.2.2 Using
the cluster
For using he cluster facilities, for a defined bean the follwing request
must ve acomplish:
-
The bean with support for load
balancing and fault tolerance, must be deployed in all the nodes of the
cluster
-
In the deployment descriptor
of the bean must be explicitly specified, that th bean must be deployed
with support for fault tolerance and load balancing,
-
On the server side, the code
for the bean must be generated in such way, that will make replication
of the bean's state any time when is required. I propose that the replication
message to be send after each transaction in which the bean is involved,
is commited.
-
On the client side, the generated
proxy must be so clevel, to make lookups for beans, using
the algorithm, proposed by me in the load balancing section. In
the generated code, the support for fault tolerance mut be implemeted too,
using the rules which I proposed.
Question 1: Can be generates the code for the eneterprise bean, and the
client proxy, to support load balancing and fault tolerance, in the fashion
proposed here??
Question 2: An experienced developer of JOnAS must be too, envolved
in this task. Which will be the guy that will help me here?
4 Requirements
I describe here
a set of facilities required by LBFTS; implementing these facilities we
can say that we have an application server with a good support for load
balancing and fault tolerance. This is not an exclusive list of requirements,
so any input for the community is welcome. I hope that the JOnAS comunity
will specify the target version for eacg requirment
No
|
Description
|
Target version
|
Remarks
|
1
|
Load balancing for sessions bean, strategy per bean
|
|
Can
be make in the first release
|
2
|
Support for various (and customizable) load balancing
strategies
|
|
Can
be make in the first release
|
3
|
Fault tolerance for stateless session bean
|
|
Can
be make in the first release
|
4
|
Fault tolerance for statefull session bean
|
|
Can
be make in the first release
|
5
|
Customizable transport protocol between clusters
node
|
|
Can
be make in the first release
|
6
|
Secure communication between clusters node
|
|
This
depends on transport implementation. The support for it can be implemented
in the first release.
|
7
|
Bean state replication after each transaction
|
|
Will
be great to have it in the first release.
|
8
|
Bean state replication after each method
|
|
Can
be make in the first release
|
9
|
Replication messages based on beans serialization
|
|
Will
be make in the first release.
|
10
|
Replication messages based in some more “light”
messages
|
|
Maybe
in a future version.
|
11
|
Fault tolerance for entity beans
|
|
In
a future version. The support for load balancing and fault tolerance for
session beans can cover this request, if the contaiiner will implement
the EJB 2.0 specification, because is a recommanded programming practice,
to use remote only session beans, and the entity beans to be accessed via
their local interfaces, only by the session beans.
|
5 Roadmap
To achieve our goal,
I propose the following roadmap:
No
|
Activity
|
Time estimation
|
Remarks
|
1
|
Discussion of this document
in the JOnAS community
|
3-5 days
|
The result of the discussion
should be, the final release of this document
|
2
|
Developing a very simple
javagroups-based program, which will be tested by the objectweb
|
3-5 days
|
The main target of this
first test is to have a proof of concept for a reliable, serverless transport
protocol, which will be used by the LBFTS.
|
3
|
Develop the general framework
for LBFTS, and a test application
|
7 –10 days
|
|
4
|
Test the general framework
for LBFTS
|
3 – 5 days
|
|
5
|
Integrate the LBFTS with
the JOnAS
|
???
|
This is problematic task
for me, because I'm not familiar with JOnAS architecture
|
6
|
Test the entire system
|
???
|
After this, we will havea
great piece of software ;)
|
6 Open
questions
1. How much time will require
the integration with JOnAS of the LBFTS???