<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=1194005000700189&amp;ev=PageView&amp;noscript=1">
VizuriBlogHero.jpg

From Vizuri's Experts

Configuring HornetQ for failover (1 of 2)

This is part one of a two-part blog post on configuring HornetQ for failover using replication. See Part 2 here.

HornetQAs part of our JBoss consulting practice we work with many customers that rely heavily on messaging technologies to conduct business, both internally and with 3rd party organizations. They expect that their systems will deliver messages in a timely, secure, and reliable fashion to ensure that the information is received and processed in the order in which it was sent in a manner resilient to server failures. 

Durable vs. Non-durable

With messaging technologies whether or not resiliency is needed depends on the importance of the payload. If you have a business that is sending transient or non-essential information it may be OK that a message is lost based on a server failure or error condition.  But what if the message contains critical information on which subsequent business processes rely? The answer is clustered and durable queues.

HornetQ Cluster Configuration

Ideally, a clustered messaging configuration should exhibit the following properties:

  • Messages should be distributed among the available systems
  • Messages should be durable, such that they survive a server crash
  • If one node goes down, its messages failover seamlessly to another node
  • Delivery of messages follow the appropriate messaging semantics
    • Message queues result in a message being delivered once and only once
    • Message topics result in a message being delivered to each node

A typical symptom of a mis-configured failover configuration is that other active HornetQ nodes do not pick up messages on a failed HornetQ node.  Messages are stuck on the failed HornetQ node until the failed node is started back up.

The root cause for the above symptom is this:  HornetQ stores in-flight messages in a series of files referred to as journals.  In order to have messages continue to process after a node failure, these journals must be available to a backup HornetQ node that will pick up where the failed node left off. This can either be achieved using “shared storage” or “replication.”

hornetq-highavailability-cluster

HornetQ Failover Configuration

For failover, shared storage requires that the journals are located on a cluster file system mount (e.g. GFS) which is not a trivial requirement.  GFS is typically recommended when the technology is already adopted and in use in the data center, as it requires additional system administration expertise and IT infrastructure.

Replication, however, allows the active and backup HornetQ servers to maintain their own copies of the journals. The backup servers keep their copy in sync with the active servers through background network traffic. In the event that the active goes down, one of the backups takes over where things left off.

If a configuration is not setup with any backup HornetQ servers, and stores it’s journals in a non-clustered location, it results in “stove piped” message brokers that result in a total lack of failover behavior.

A note about HornetQ “servers” vs. JBoss “servers”:  A JBoss server will refer to one managed instance of JBoss used for actual application deployment (i.e. not a host or domain controller).  One JBoss server can contain one or more HornetQ servers.  HornetQ servers for our context, are configured and contained within a JBoss instance, and consist of the necessary connectors and acceptors to receive and deliver messages.

Proposed Configuration

The proposed configuration is adapted from the “reference architecture” from the Red Hat team. This consists of the following setup:

  • A set of 3 JBoss servers are configured
  • Each JBoss server has 3 HornetQ servers, 1 active and 2 backups
  • Each active HornetQ server is tied to one of the backups on each of the other JBoss servers, so every instance is redundantly backed up leaving no single point of failure
  • The backup HornetQ servers are kept in sync with their respective active server using replication
  • The backup HornetQ servers do not have the connectors necessary to actually deliver the messages
  • When an active HornetQ server fails, one of the backups will be determined to be in charge for it’s group
  • Since it can’t actually process the messages, it will immediately hand them over to an active sibling that is capable of doing the work
  • A failed primary node that comes back online:
    • Will find the server currently in charge of the group and will resynchronize it’s journals so the messages are not reprocessed
    • It will then resume control and the other server will revert to being a backup

This setup achieves the instant failover desired, at the cost of some additional configuration complexity. The original reference architecture creates 3 profiles to achieve the backup matrix (see table below). In certain applications this would vastly increase the complexity and fragility of application maintenance. For example, if one of the system properties, JNDI entries, or data source configurations were changed, they would have to be changed in 3 different places. This makes the potential for error fairly high.

We discovered, however, that through the use of parameter substitution, we can achieve the same configuration using a single profile. All of the configuration settings will be examined in a subsequent section, but it is noteworthy that the only thing different between each JBoss profile is the backup group assignments to achieve the redundant, distributed arrangement:

HornetQ Server JBoss Node A JBoss Node B JBoss Node C
Primary backup-group-1 backup-group-2 backup-group-3
Backup 1 backup-group-2 backup-group-3 backup-group-1
Backup 2 backup-group-3 backup-group-1 backup-group-2

This would become even more cumbersome without the parameterization approach if we were to say, move to a 4-node cluster:

HornetQ Server JBoss Node A JBoss Node B JBoss Node C JBoss Node D
Primary backup-group-1 backup-group-2 backup-group-3 backup-group-4
Backup 1 backup-group-2 backup-group-3 backup-group-4 backup-group-1
Backup 2 backup-group-3 backup-group-4 backup-group-1 backup-group-2

This can easily be done with the system property approach, whereas the separate profiles would continue to grow in complexity.

Coming up in Part 2

In Part 2 we'll talk about common scenarios in which potential "split-brain" issues related to a replication based approach can occur, dig into the configuration of HornetQ, and how to manually walk through failover scenarios using a simple test application. 

Are you looking for JBoss consulting services?  We have a team of experts on staff ready to help - click here to schedule a complimentary consultation today.

Jiehuan Li

A former Vizuri developer, Jiehuan Li brought more than fifteen years of experience in designing customized IT solutions to Vizuri’s projects. Interested in this post? Connect with our team of experts by