The Data-Centric Approach
The data-centric approach requires that both the data producer and data consumer know both the existence and location of one another, and use identical data structures to exchange the data item. This approach can be efficient under many circumstances, but it is often easier to begin the application design process from a data-centric perspective instead. Rather than looking at specific data requirements for processing end-points, approaching the design from the standpoint of what data is generated through acquisition or processing may make more sense. If the data can be useful to any other process, it can be made available without knowing where or when it might be used.
This emerging alternative for asynchronous data transfers is publish-subscribe (or "pub-sub"). In this model, data sources, or producers, publish data to a known location on the network. This could be a memory-to-memory transfer, if high performance is required, or it could be a database or other persistent storage.
Processes that need that data can subscribe to it through the messaging service. When published data arrives at the shared location, a message goes out to the subscribers. Subscribers can then go to the shared location to obtain the data, and use it in their own processing.
A publish-subscribe model for data distribution enables the implementation of such a data-centric architecture across a large-scale network. For example, using the programmatic-trading example, a node can publish CCI index data to a known location on the network, and the other trading subsystems can subscribe to the CCI index data
With the publish-subscribe approach, you can upgrade or add endpoints without having to change code, or you can even test the resulting configuration exhaustively. Certainly if new data is available on the network, other endpoints may require additional code in order to make use of that data, but in practice, this is significantly simpler than modifying and testing a large number of specific one-to-one connections.
A commonality of data formats isn't necessary in a publish-subscribe model. Because the source and destination of a given data item is unknown, any required data conversion occurs at the handoff. In fact, data consumers may well have different data format requirements, necessitating individual conversions based on those needs. The middleware that provides for the publish-subscribe services can manage any required data conversions.
Designing with Data-Centric Principles
Data-centricity provides a guide for designing distributed applications in general. Many system architects of distributed applications today use procedural or object-oriented principles in creating the fundamental design, often using UML sequence, class, and state diagrams.
These design methodologies tend to treat transporting and consuming data as second-class citizens, focusing instead on the step-by-step processes by which endpoints make computations and produce actionable results. A data-oriented methodology, on the other hand, focuses on the flow of data through the application.
Taking a closer look at the alternative design/development methodologies only highlights the issues. Procedure-oriented approaches focus on the computational aspects of the application, and concentrate on the device endpoints of the network where almost all processing occurs. Data is structured in order to assist with the computation, rather than making the data available in the first place. Once the computational processes are designed, the problem of getting data to those processes remains. However, this makes data movement almost an afterthought, and problems of timeliness and data formats become more difficult to address. Ultimately, the focus on computation means that the endpoint devices must have functions and data that are known and accounted for by one another.
Object-oriented methodologies base designs on the definition objects and their interactions. Because objects are defined by their computational processes (methods) as well as data, this approach appears to have a good balance between code and data. However, object-oriented methodologies presume that data is an inherent part of the object. The data formats are fixed within the object, and the methods act upon the data only within the context of the object. There is no inherent provision for exchanging data between objects, and for adapting data to the formats required for various processes.
The overriding characteristic of these common design methodologies is that they envision a distributed application as a set of processes, designed as procedures or objects. Data is assumed to be contained within each of the processes. Getting data to and from a particular process is a problem to be addressed after all of the processes are defined.
Alternatively, a data-centric methodology provides a much more natural and streamlined way of viewing and modeling many distributed applications. Such a methodology focuses on the data that is moving and transforming in the system, rather than the processes that are performing those actions.
In other words, the processes encapsulated in the endpoints become secondary. The flow of data defines the essential aspects of the application, and can be modeled with UML class and interaction diagrams. That is not to say that the computational processes are not important, but rather that they are not essential at a high level of design in understanding and mapping out the application. In real-time systems, getting the data to the process that requires it, when it requires it, is essential.