It used to be, within most organizations or enterprises, that data existed in a box. That box may have been a file system containing a repository of word processing files and spreadsheets or it may have been a dedicated database system built using SQL in order to store the relevant information as tables of content within databases contained in the RDBMS. Just as typically, the data that existed within the box was usually not immediately in a human readable form; instead, applications were developed which would provide both a visual and a programmatic set of interfaces between the data and the users of that data.
Over time, as the number of data boxes proliferated, this led to the point where the applications had to talk with one another, which of course introduced another layer of interface, this one eventually gaining the moniker of Enterprise Application Integration, or EAI. Once the acronym exists, then shortly on its heels comes the proliferation of whole suites of commercial tools that are designed to satisfy the perceived need of the acronym, and EAI was no exception. EAI tools often tended to encourage vendor lock-in, as it is generally easy to integrate applications that are built upon the same foundation technology.
Significantly, while companies poured tens of millions of dollars down the EAI black hole, the actual ability of making use of the data in the boxes did not significantly improve, in part because the nature of that data was changing. As the Internet and its http: protocol served in time to overtake most other communication protocols, this provided a way of connecting first static documents, then dynamically generated documents (which were, in and of themselves a form of EAI) then increasingly streams of data presented in formats such as XML, JSON, YAML and so forth.
The EAI industry addressed this by promoting the use of the SOAP specification and a Services Oriented Architecture (or SOA), which used existing EAI methododology of integrating applications through the use of messaging conduits – save for the different form of the protocol, most SOA was not significantly different from the use of CORBA or, earlier, the use of EDI (electronic data interchange), in its approach. Not surprisingly, SOA became the champion for a new generation of proprietary software offerings and consulting specialists. Yet what was, at least to the industry, somewhat more surprising was that, in the end, most SOA solutions did not in fact offer a significant savings in operation, did not reduce (and in many cases complexified) the underlying integration challenge, and often created systems that were only slightly less fragile than the box to box customization solutions that had become the norm, a report echoed by Mark Madsen of Third Nature in his white paper The Role of Open Source in Data Integration
While much of this was going on in the Enterprise Application Integration and related Business Intelligence (BI) space, there were changes in the way that system architects were beginning to think about the whole nature of data and databases. Syndication feeds represented a way of presenting dynamically changing document collections (such as blog or news reports) without having to create an explicitly dependency upon a given database or application suite.
The Rise of New Protocols
The rise of blogging software starting in 2003 made it possible to distribute such documents, which in turn led to the rise of new protocols such as the Atom Publishing Protocol (which used the Atom XML format to act as a conduit) for encapsulating both links to resources (blog posts initially, but more generalized data as time went by) and publishing metadata. Such systems in turn tracked the growth of the REST and RESTful Services memes, as mechanisms for accessing distributed resources in a formal database fashion.
Such solutions, which are increasingly being seen as mechanisms for transporting generalized blocks of data rather than just documents (which can be thought of as one form of data) are becoming collectively known either as Data as a Service (DaaS) or Data Integration (DI). DI systems emerged as an outgrowth of primarily open source tools and toolsets, both because the DI approach is a logical succession to the syndication architecture used by RSS (as well as, in slightly modified form, the increasing emergence of Web REST APIs such as those used by Twitter and Facebook, among many, many others). In all cases, the key is the use of a data set consisting of metadata about the collection of resources, along with individual entries with associated publishing metadata blocks, possible containment of the actual data in those blocks, and tone or more links to the producer of that “blob” of data (whether static file, converted database entry, or XML resource).
This was, ultimately, a completely vendorless solution, yet one that for a number of reasons spread quickly through the Intenet as a preferred mechanism for data architecture. Open Source solutions in general tend to thrive best in environments where there is comparatively minimal differentiation possible, and as such, comparatively little cost benefit to transitioning from one service provider to another that could in turn create the necessary profit differential to satisfy investors and stockholders.
Yet the advantages of a Data Integration approach over Enterprise Application Integration are considerable. Most EAI systems assume some kind of underlying transactional operation which in turn affects how the data is presented and passed, and in many cases actually sees the role of the messaging format (SOAP with potential enclosures) as simply a mediator between method distributed method calls, to be converted back into binary objects at the other end. The DI/DaaS approach, on the other hand, assumes a CRUD (create, read, update delete) orientation on collections of resources, and lets the consumer of the passed information actually perform the relevant processing. This in turn means that it does not become necessary for DI/DaaS systems to physically retain the information in local data stores, and as a consequence it significantly reduces the siloization that is such a hallmark of complex EAI (especially SOA) systems.
In the long term, the competitive advantage that open software has in this space, combined with the underlying difficulty in establishing differentiating barriers and transactional gradients that are characteristic of large scale “integrated” systems, means that it is unlikely that commercial proprietary solutions will significantly challenge the existing open software market in the Data Integration space. That doesn’t mean that there isn’t some potential for this – static CRUD solutions often have significant limitations of their own that suggest that dynamic CRUD solutions — RESTful Services around non-traditional data abstractors (such as the XQuery language) and data repositories will likely provide a thin but sufficient layer to make DI/DaaS both powerful and cost effective.
Moreover, such solutions make the implicit assumption that data is pervasive — far from being locked up in boxes, the data space is constantly in motion over the Internet and associated intranets, and data integration is far more likely to be involved with reaching out and tapping into a data stream than it is in building dedicated data boxes.