pplication architects design distributed applications based largely on their computing resources and network infrastructure. The goal is to ensure that users have ready access to computing resources, and that those computing resources have access to application data. While the object-oriented development approach is useful for developing applications in general, a data-centric approach is better for designing and developing distributed applications.
This article introduces the data-centric approach, explaining how to design with data-centric principles and implement data-centric applications. You will learn how data-centric design and data-oriented implementation can enable more robust and scalable distributed systems that also are easier to maintain and enhance over time.
|Editor’s Note: The authors, Dr. Gerardo Pardo-Castellote and Supreet Oberoi, respectively are the CTO and VP Engineering for Real-Time Innovations, Inc., a vendor of real-time information networking solutions. We have selected this article for publication because we believe it to have objective technical merit.|
The Challenge of Building Distributed Applications
Consider a programmatic financial-trading system with subsystems running trading algorithms based on market data that they receive from data feeds such as Reuters. The data feeds provide information such as open market orders, daily highs and lows for a symbol, volume, number of trades, and closes, among other things. In addition, many of the subsystems produce data that is essential to the successful operation of some others, or that at the very least would significantly enhance their ability to produce quality results. For example, one of the subsystems can produce information such as long CCI index and short CCI index that is useful to other systems making trading decisions. These systems are connected through a standards-based network, so data communication between endpoints is common and continuous.
One way to ensure that these disparate systems leverage data from one another across a network is by examining the data requirements of each distributed end-point and drawing one-to-one interfaces between those end-points. Once the relationship is established, the data is passed between those two end-points, likely using a polling or producer/consumer architecture. However, there are significant challenges in building and maintaining distributed applications with this approach.
Complex Lifecycle Management
The application architect cannot assume that the network and application architecture is static and unchanging. If a market-data feed is swapped for an upgraded pipe, the targeting algorithmic code has to be adapted to recognize and work with the new market-data format. Further, trading subsystems could be co-dependent on one another; they may use data from one another to successively refine initial estimates, or to locate and successively target trading prices and volumes.
Designing these distributed systems is complex, and maintaining them throughout the system lifecycle can be a technical and logistical nightmare. Every upgrade and system modification will require extensive testing to ensure that changes did not introduce incompatibilities. Code changes will be more likely than not.
Coordinating between servers, and between servers and clients, is complex and technically difficult. By establishing direct one-to-one interfaces between endpoints, upgrading parts of the system becomes complex, requiring code changes and exhaustive testing of new configurations. Further, if new data is made available on the network, the other endpoints will require additional code even if they do not plan to use the data. All one-to-one connections will need code modification and extensive testing.
This characteristic makes scaling distributed applications very challenging. Since one-to-one data connections are fragile, and because new endpoints that require change to underlying code can be added, expanding the application to include a larger network with more endpoints is technically challenging.
The Data-Centric Approach
The data-centric approach requires that both the data producer and data consumer know both the existence and location of one another, and use identical data structures to exchange the data item. This approach can be efficient under many circumstances, but it is often easier to begin the application design process from a data-centric perspective instead. Rather than looking at specific data requirements for processing end-points, approaching the design from the standpoint of what data is generated through acquisition or processing may make more sense. If the data can be useful to any other process, it can be made available without knowing where or when it might be used.
This emerging alternative for asynchronous data transfers is publish-subscribe (or “pub-sub”). In this model, data sources, or producers, publish data to a known location on the network. This could be a memory-to-memory transfer, if high performance is required, or it could be a database or other persistent storage.
Processes that need that data can subscribe to it through the messaging service. When published data arrives at the shared location, a message goes out to the subscribers. Subscribers can then go to the shared location to obtain the data, and use it in their own processing.
A publish-subscribe model for data distribution enables the implementation of such a data-centric architecture across a large-scale network. For example, using the programmatic-trading example, a node can publish CCI index data to a known location on the network, and the other trading subsystems can subscribe to the CCI index data
With the publish-subscribe approach, you can upgrade or add endpoints without having to change code, or you can even test the resulting configuration exhaustively. Certainly if new data is available on the network, other endpoints may require additional code in order to make use of that data, but in practice, this is significantly simpler than modifying and testing a large number of specific one-to-one connections.
A commonality of data formats isn’t necessary in a publish-subscribe model. Because the source and destination of a given data item is unknown, any required data conversion occurs at the handoff. In fact, data consumers may well have different data format requirements, necessitating individual conversions based on those needs. The middleware that provides for the publish-subscribe services can manage any required data conversions.
Designing with Data-Centric Principles
Data-centricity provides a guide for designing distributed applications in general. Many system architects of distributed applications today use procedural or object-oriented principles in creating the fundamental design, often using UML sequence, class, and state diagrams.
These design methodologies tend to treat transporting and consuming data as second-class citizens, focusing instead on the step-by-step processes by which endpoints make computations and produce actionable results. A data-oriented methodology, on the other hand, focuses on the flow of data through the application.
Taking a closer look at the alternative design/development methodologies only highlights the issues. Procedure-oriented approaches focus on the computational aspects of the application, and concentrate on the device endpoints of the network where almost all processing occurs. Data is structured in order to assist with the computation, rather than making the data available in the first place. Once the computational processes are designed, the problem of getting data to those processes remains. However, this makes data movement almost an afterthought, and problems of timeliness and data formats become more difficult to address. Ultimately, the focus on computation means that the endpoint devices must have functions and data that are known and accounted for by one another.
Object-oriented methodologies base designs on the definition objects and their interactions. Because objects are defined by their computational processes (methods) as well as data, this approach appears to have a good balance between code and data. However, object-oriented methodologies presume that data is an inherent part of the object. The data formats are fixed within the object, and the methods act upon the data only within the context of the object. There is no inherent provision for exchanging data between objects, and for adapting data to the formats required for various processes.
The overriding characteristic of these common design methodologies is that they envision a distributed application as a set of processes, designed as procedures or objects. Data is assumed to be contained within each of the processes. Getting data to and from a particular process is a problem to be addressed after all of the processes are defined.
Alternatively, a data-centric methodology provides a much more natural and streamlined way of viewing and modeling many distributed applications. Such a methodology focuses on the data that is moving and transforming in the system, rather than the processes that are performing those actions.
In other words, the processes encapsulated in the endpoints become secondary. The flow of data defines the essential aspects of the application, and can be modeled with UML class and interaction diagrams. That is not to say that the computational processes are not important, but rather that they are not essential at a high level of design in understanding and mapping out the application. In real-time systems, getting the data to the process that requires it, when it requires it, is essential.
Implementing Data-Centric Applications
Data-centric applications can be implemented using the precepts of data-oriented programming. In general, the tenets of data-oriented programming include the following principles:
- Expose the data. Ensure that the data is visible throughout the entire system. Hiding the data makes it difficult for new processing endpoints to identify data needs and gain access to that data.
- Hide the code. Conversely, none of the computational endpoints has any reason to be cognizant of another’s code. By abstracting away from the code, data is free to be used by any process, no matter where it was generated. This provides for data to be shared across the distributed application, and for the application to be modified and enhanced during its lifecycle.
- Separate data and code into data-handling and data-processing components. Data handling is required because of differing data formats, persistence, and timeliness, and is likely to change during the application lifecycle. Conversely, data processing requirements are likely to remain much more stable. By separating the two, the application becomes easier to maintain and modify over time.
- Generate code from process interfaces. Interfaces define the data inputs and outputs of a given process. Having well-defined inputs and outputs makes it possible to understand and automate the implementation of the data-processing code.
- Loosely couple all code. With well-defined interfaces and computational processes abstracted away from one another, endpoints and their computations can be interchanged with little or no impact on the distributed application as a whole.
Table 1 summarizes these and other principles, and it offers a comparison with object-oriented development tenets in order to contrast the two different approaches. The data-oriented approach enforces attention on the data rather than on the processes that manipulate the data.
|Object-Oriented Programming Principles||Data-Oriented Programming Principles|
|Hide the data (encapsulation)||Expose the data (with MR format)|
|Expose methods ? code||Hide the code|
|Intermix data and code||Separate data and code|
|Mobile code||Must agree on data mapping, mapping system|
|API/object model||Messages are primary data model or schema|
|Combined processing, no restrictions||Strict separation of parser, validator, transformer, and logic|
|Changes: Read and change code||Changes: Change declarative data file|
|Tightly coupled||Loosely coupled|
|Table 1. A Comparison of Data-Oriented and Object-Oriented Programming Principles|
The data-oriented approach to application design is effective in systems where multiple data sources are required for successful completion of the computing activity, but those data sources reside in separate nodes on a network in a net-centric application infrastructure. For network-centric distributed applications, applying a data-oriented programming model lets you focus on the movement of data through the network, an easier and more natural way of abstracting and implementing the solution.
Data as the Design and Implementation Focal Points
Data-centric design and data-oriented implementation can bring about a more robust and scalable distributed system, and one that is easier to maintain and enhance over time. For real-time distributed applications that are highly dependent upon the movement of data through the system, the advantages of using data as the design and implementation focal points can make the difference for a successful project.