Distributed systems work by sending information between otherwise independent applications. Traditionally, that communication is done by passing messages between the various nodes. This "message-centric" approach takes many forms, from simple direct transmissions to more complex message queue and transactional systems. All have a common premise: the unit of information exchange is the message itself. The infrastructure's role is to ensure that messages get to their intended recipients.
Recently, another paradigm is becoming popular. In this approach, the distributed infrastructure takes more responsibility; it offers to the distributed system a single version of "truth". The fundamental unit of communication is a data-object value; the infrastructure has done its job not when a message is delivered, but when all nodes have the correct understanding of that value. Because the focus is on the data itself, this is termed "data-centric" infrastructure.
While both types of middleware serve to connect distributed systems, the approaches are quite different. They result in different system capabilities, strengths, and weaknesses.
Table of Contents
- Message-centric infrastructure sends messages of unknown content
- Data centric infrastructure updates data with known content
- Simple thermometer example
- Sharing and managing state
Distributed systems must coordinate multiple independent, concurrent applications. The communication paradigm, how to get the right information from the right producers to the right consumers at the right time, is perhaps the first and most important design consideration. Because it impacts every application, this choice is difficult to change. And, it drives many key properties, including timeliness, scalability, reliability, availability, configuration, and evolution.
Distributed systems must also share and manage state. With multiple processors in the system, even seemingly trivial questions (like "what time is it?") have no crisp answer. In a complex system, every application both generates and uses distributed and local state. Like the communications paradigm, the strategy to understand and manage this state is a fundamental design decision.
There are many, many other considerations, including joining and leaving the network, discovery, error handling, and legacy integration. However, matching the communications paradigm and the state management approach to the system are foremost. This is where the difference between message and data centricity has the greatest impact. We will focus on these factors.
There are several types of message-centric middleware. These include direct point-to-point messaging, queuing, broker, and transactional systems. They support different interaction patterns such as request/reply, remote-method invocation, transactional exchange, queues, and publish subscribe. Current implementations include RabbitMQ, IBM Websphere MQ, and Tibco Rendevous. Popular standards include the Java Message Service (JMS) API, and the Advanced Message Queuing Protocol (AMQP) wire specification.
The important common property for this discussion is that these technologies consider the message itself to be the means of interaction. The infrastructure is unaware of the message contents. With a message-centric approach, developers write applications that send messages between participants.
Data centricity, by contrast, makes the data the means of interaction. A data centric infrastructure must define the data it manages. Then, the infrastructure imposes rules that determine how data is structured, when data is changed, and how it is accessed. There is a notion of a "global data space", where data values, in known structures, are exchanged. The infrastructure is aware of the contents. With a data-centric approach, developers write applications that read and update entries in this data space (Fig. 1).
Message-centric systems usually send "verbs", while data-centric systems update "nouns". For instance, consider a simple system that reads a thermometer and communicates temperature. With message-centric verbs, the thermometer would send messages like "update the temperature for sensor #1221 to 27". The middleware doesn't know this; there is no system notion of sensor ID or temperature. The applications must understand that these messages tell them to change their notion of temperature. The application developer must define functions and parameters and take the right actions.
With data-centric nouns, the sending application sets the value of the known global data object "temperature of sensor #1221" to "27". The middleware knows what that means and distributes the information to nodes that need it. Receiving applications read the "temperature" value from the object, likely upon notification that it has changed.
This difference may not be obvious until you consider what happens, for instance, when an application joins after missing 50 updates to the value of sensor #1221 and 30 to sensor #1222. The message centric system will send all 80 updates, and the receiver must then process each one. The data-centric middleware distributes the data space, and then simply allows the late joiner to read the latest value of each object, only two in this case. Both end up with the right value, but the network traffic and application processing is very different.
Distributed systems must also share and manage state between the various applications. Applications manage state in message-centric designs State is irrelevant to message-centric middleware. Since the middleware does not know what messages contain, the middleware can work only with messages themselves. The middleware can control behaviors of the stream, but not properties of the data contained in messages.
In the temperature sensor example, messages could take many forms. They may contain time-stamped temperature readings, changes in the temperature reading, extrapolations of the temperature reading, or anything else at all. The middleware has no notion of the meaning of messages. It can therefore make no judgment about the importance of any particular message in the stream, regardless of what might come before or after it. The middleware delivers all the indistinguishable messages equally. This implies that the application, as a whole, must manage "truth". Every application reads the messages and ascertains its own version of truth. The application code manages the distributed state.
Data-centric middleware disseminates and manages state. It doesn't send messages about the data; it sends the data itself. Consuming applications then pick up reflections of the global state. To do this, the middleware must know the data schemas, treat instances of data as known objects, and attach behaviors to the data, not to the flow.
Again consider the temperature sensor. The middleware knows the temperature is a floating-point number. It updates that number and then notifies all consumers to read the new value. It can even make judgments about the values, such as storing the last N values, only sending values that exceed a limit (e.g. to an alarm monitor application), or warning applications if the value has not been updated recently. The infrastructure itself manages the distributed state.
This means that the infrastructure is responsible for providing a consistent view of "truth". In a dynamic distributed system, it's not simple to have "absolute" truth, but the middleware can be crisp about exactly how reliable the information is, how old it is, how many past versions are saved, what delivery guarantees there are, what happens if a producer fails, etc. These "Quality of Service" (QoS) settings specify how to manage the state of each data item.
This contrast is not unique to distributed networking. Storage faces similar challenges. Before databases, storage systems were files with application-defined structure. That is simpler, but offered no basis for tools, consistency, or integration. The database arose as a data-centric infrastructure technology that provides one key benefit: a source of consistent truth.
Databases offer a clean conceptual interface to data: create, read, update, and delete (sometimes called CRUD). By enforcing structure and simple rules that control the data model, databases ensure consistency. By exposing the structure to all users, databases greatly ease system integration.
Like a database, data-centric middleware imposes known structure on the transmitted data. The infrastructure, and all its associated tools and services, can now access the data through CRUD-like operations. Clear rules govern access to the data, how data in the system changes, and when participants get updates. As in databases, this accessible source of consistent truth greatly eases system integration. Thus, the data-centric distributed "databus" does for data in motion what the structured database does for data at rest.
Returning to the verb/noun analogy, the key difference is that in a verb-based system, applications interact directly with each other. In a data-centric noun-based system, applications interact with the data model, not directly with each other. This reduces coupling. Verb-based systems need many commands and actions, and this set grows with the number of applications. Noun-based systems support only a few actions on the data, and that set does not grow with the number of applications.
Data-centricity and message-centricity are not black and white opposites. For instance, message-centric designs can use databases or other data-centric means to manage state. Data-centric designs can guarantee acknowledgement of each update message. Some systems even combine both approaches. Nonetheless, infrastructure that fits the problem simplifies application code greatly. Choosing the right paradigm deserves careful thought.
Let's explore some questions that make sense to ask.
Overall, a distributed application has a lot of state. If the infrastructure doesn't manage state, then every application must store and maintain its own state. Unmanaged state leads quickly to inconsistency. This is brittle; inconsistent systems break in many ways. With a data-centric approach, state is managed, redundant producer failover is natural, and multiple consumers can easily access consistent state. In many ways, a data-centric design is easier to make reliable and available.
However, delegating state control to the infrastructure has cost. The system needs a shared data model, so data schema and properties must be globally agreed (or later mediated, a topic beyond this treatment). Messaging systems can be much more lax in their upfront data definitions. Keeping state control in the applications, while an added burden, also gives application developers more design flexibility.
Database engines and tools can operate on the content of tables because of the well-defined data schemas. This "content aware" infrastructure can examine, match, and deliver the right data. By contrast, with just unstructured files, the infrastructure is "blind". The application code must implement and execute the logic to load and search.
Data-centric middleware is also content aware, but for moving data. Content-aware middleware lets new components learn schema and thus interoperate. It enables generic tools, reusable applications, selective delivery, and structured designs.
Content-aware integration can be illustrated through a simple example of Excel integration (Fig. 2). A single generic plugin, with no a priori knowledge of the data, can interpret the structured data types. Then, it can populate the spreadsheet cells with named fields from the global data space. So, without writing any code or configuring structures, all of Excel's math, graphics, and analytics, are immediately usable.
Message-centric approaches treat messages as sacred. They are best when the system design focuses on distributing "things" and "processing" them. Transactional behavior that must assure processing of every message may be more natural with messaging technology. If the information is naturally centralized, client-server messaging is a good fit. Load balancing that doles out single messages across many servers for processing may be easier with a queuing design. These techniques are quite mature with message-centric systems.
On the other hand, data-centric systems excel at scalable information distribution. Sending to many recipients (fanout) is efficient. Using data QoS to find and deliver the information needle in Terabytes of haystack is much more practical. Tight latency or real-time delivery constraints are easier to specify and satisfy. Complex systems with nodes that come and go can take good advantage of discovery and state management. And, because moving the state to the infrastructure simplifies and decouples applications, integrating many different functions is often easier.
Messaging systems are simpler if you just want to send data from point A to point B. Defining data types, specifying data interaction properties, and building the scaffolding required of a data-centric infrastructure may not be worth it for quick projects with limited lifetimes. If the system must be fielded in a few months, is limited to a small number of cooperating applications, is being developed by a cohesive team, and will likely be obsolete in a year, then the more straightforward messaging approach will likely be easier.
On the other hand, if a lasting architecture is a key benefit, consider data-centric infrastructure. It will require more upfront planning. The concepts are likely less familiar, and will therefore seem complex. But, benefits do accrue as the system incorporates more applications, scales, and evolves over time. Distributed teams, complex system integration, and long-term viability easily justify the added initial investment in data centricity.
Even small systems should start with an analysis of the data content and exchange requirements. Next, think through the design implications of each paradigm. What happens if I add this? Which order do things have to start? How fast does this module need data? When do peers find out this module failed? These answers will expose coupling.
Consider performance early. Key fanout, real-time, or load-balancing requirements may suggest an optimal architecture. Plan for evolution. In a distributed system, you can't update all your applications on all your nodes at the same time. So, in reality, you will have a mix at any given time. Although it is critical to start with an understanding of your data model and flow, it's also important to plan for change.
Data centric and message centric infrastructure are both proven communication technologies. Both can solve many problems. Choose the approach that reduces coupling, delivers performance, and supports your long-term architecture.