In today’s knowledge driven industry, a huge amount of digital information is generated, collected, processed and maintained for future use. It is one of the key activities and requirement in any organizations. The digital contents include various business processes and planning documents; product requirements, design and maintenance documents; user guides and support manuals; project management planning and execution reports, survey reports, customer or end user feedback and complaints, white papers, memos, archived emails, social media feeds, etc. Reliably storing, managing and making the required digital content available to authorized clients, both human and machine, is really a challenging task.
In this article we have proposed architecture of a digital content repository, which helps one to store and manage different types of digital content efficiently. The architecture meets the primary design goals of such solutions including scalability, reliability, extensibility, document versioning, ability to search stored content and user access control. It is based on various open source big-data solutions available in the current market, which makes the proposed architecture highly scalable, cost effective and minimum or no vendor lock-in with respect to new technology changes, software version upgrade or licensing.
Key Design Goals of Big Data Content Repository
Digital content is highly dynamic (the original contents and associated metadata keeps getting changed due to various business requirements) and bulky in nature (the size of the content vary, ranging from a few kilobytes to terabytes and maybe more!). Similarly one needs to maintain a backup of the original content for disaster recovery and handle the unexpected content loss scenario. To efficiently handle these requirements, architecture and design of any digital content management solution should have the following key properties to make it practically useful for serious business requirements.
- Extensible: With the ever changing business requirements, there is always a need to change the original content or the metadata associated with it. The content repository should provide support for allowing such changes without affecting the existing content structure.
- Reliable: Since many business activities and processes depend on the content repository, there must be no data loss of any kind and the repository should be highly available and fault tolerant (should be able to handle data corruption and total recovery).
- Scalable: The repository should be highly scalable with respect to the storage capacity and amount of requests it can handle. Because of ever generating digital content out of various business processes, size of the stored content can grow rapidly and the storage limit should not be a roadblock for any content repository. Similarly, the architecture should be capable enough of handling a varying number of user requests.
- Versioned: It is a very common practice in business, where the original content of the document keeps changing over a period of time, yet the user should still be able to access all of the versions (changes) of the document. For example, the project requirements document or design documents change over a period of time. The content repository should support versioning of documents similar to the one used for the computer program source code files maintained in the version controlled system (e.g. CVS, SVN or Git).
- Controlled access: The read/write/delete operations on a document should be done by a user if and only if he/she has proper access permissions to do so. Only an authorized user should be allowed to access or modify the document or related metadata, which is very critical for any content repository in a production environment.
- Searchable:The repository should provide an interface through which users can search all or a subset of the documents and metadata stored in the repository using search keywords. Without this capability, it is very difficult to find specific content having some random words in it.
- Cost efficient: The selection of software components used for building the repository should be cost effective, and minimum or no vendor lock with respect to new technology changes, software version upgrade or licensing.
By considering the above mentioned design goals, the following section describes the high-level architecture of the proposed content repository.
Big Data Content Repository Architecture
The diagram in figure 1 shows the high-level software architecture of the proposed content repository. It shows various software components used in building the repository with the flow of command and data. Most of the software components are open source software or libraries, which use Linux as its operating system platform.
Figure 1: High-Level Software Architecture of the Content Repository
All the end user requests for basic CRUD operations or search queries are handled by a web service module called User Request Handler. The request are filtered by the Access Control module, which only allows operations with proper user permissions to pass through and make an appropriate call, either to document controller or search controller. The document controller module makes calls to Lily repository for creating, reading, updating or deleting records. Lily repository stores content either in Hbase or in HDFS based on the content size. The search controller acts as a proxy to Apache solr and provides uniform access to make search queries to Solr. It hides all the Solr specific complexities from other client services.
Following are descriptions of various components in Figure 1.
- User Request Handler: This is the only component in the Client Support System with which all the repository clients (human or machine) will interact. It is a custom written SOAP based web service or RESTful (Representational State Transfer) web service, which allows users to do document upload, download, delete, update and search related functionalities. The service can be hosted on any open source web containers such as JBoss server or Apache Tomcat.
- Access Control and User Database: To perform any operation in the content repository, a user should have the necessary permissions. This is very important for any serious business application scenario. The access control module serves the same purpose. It filters all the user requests coming through User Request Handler before delivering to either Document Controller or Search Controller. Any request coming through the user without proper access permissions will be blocked and the appropriate error response status will be sent back to the user. The User Database can be an existing Organization wide User Access Directory or repository specific user database, which stores user login credentials and repository access permissions.
- Document Controller: It can be a Java based module (since at present Lily repository only supports the Java client API) or a REST client service written in any programming language, which makes call to Lily REST service to perform basic record level create, read, update or delete operations.
- Lily Repository: Lily is a distributed and scalable content repository, based on the Apache open source big data platform called Hadoop, for storing, searching and retrieving content items, documents, or any binary objects. It is a highly distributed and cloud-scale server application that fuses HBase and Solr. The Lily repository is designed to be used by any kind of front-end applications using either the Java based Lily API or through Lily service REST interface.
?The proposed architecture uses Lily’s inherent capability of scalability, record versioning and reliability. Various features of the Lily repository, such as Write Ahead Log, message queue, indexer modules, etc. make it very consistent, reliable and fault tolerant. The indexer module sends the record data to Solr for indexing purposes, which later can be searched through the Solr search interface.
- Hbase: Hbase is an open source, NoSQL or non-relational, highly scalable and distributed database that runs on top of HDFS (Hadoop Distributed Filesystem). It is written in Java. Its design is based on Google’s BigTable architecture.
- Zookeeper: ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services. All of these kinds of services are used by various distributed applications for which ZooKeeper acts as a coordinator of distributed applications.
- Hadoop: Apache Hadoop is a Java based software library and a framework that allows distributed processing of large data sets across clusters of multiple processing and storage nodes using a programming model called MapReduce.
- HDFS: HDFS is the Hadoop distributed file system used for storing large data files on cluster nodes. It is a highly reliable, fault tolerant and scalable file storage system.
- Apache Solr: Solr is an Apache Lucene based open source enterprise search platform with all the major search related features, such as full-text and faceted search, key highlighting, dynamic clustering, near real time and distributed indexing, load balancing, etc. It can support a rich set of document formats, such as Word, PDF, HTML, etc.
- Search Controller: This module takes the search query from the user request handler. It then translates the user request to an appropriate Solr search query, makes a search call to Solr,? collects, filters and processes the search results and sends the response back to user request handler. It acts as a proxy between the user request handler and Apache Solr.
The use of Hadoop, HDFS, ZooKeeper and Hbase make the proposed architecture highly reliable and scalable. Lily provides the extensibility and versioning of content and associated metadata. The use of Apache Solr makes all the content searchable. The access control module makes sure that no unauthorized access is possible to any content of the repository. Since all the software components used are enterprise class open source software, they are reliable, available free of cost and without any licensing limitations. The use of open source components avoids the vender lock situation and if needed, one can change the underlying software at any time (e.g. one can easily move from JBoss to Apache Tomcat Java container with no cost and minimum or no change in hosted web application code).
Cluster Setups for Big Data Content Repository
Figure 2 shows a possible topology of various hardware and software components of the content repository in a production environment. Basically we require three clusters of multiple nodes. The first cluster, as shown in Figure 2 as Cluster-1, is a cluster of Hadoop nodes. In Hadoop cluster, it is required to set up one Hadoop master node (Node-1) and one, or more than one slave nodes. On Hadoop master node, HDFS Name-Node and MapReduce Job-Tracker services will run and on all the slaves, HDFS Data-Node service and MapReduce Task-Tracker service will run. Similarly, on Node-1, HBase master service will run and on the rest of the nodes HBase Region services will run. On all the HBase region servers, we can run Lily Repository services as recommended by the Lily repository documentation. We can configure any nodes (as per the recommendation only odd number of nodes, i.e. total 1 or 3 or 5 etc. nodes from the cluster) as ZooKeeper servers.
Figure 2: Topology of the Content Repository in Production
The second cluster will be for the Solr, as shown in Cluster-2, where we need to set up one Solr master and the rest as slaves. The third cluster will be for hosting the web service application and setting up document and search controllers. The number of nodes in this cluster will depend on the work load and total number of users we want to support at a time. Apart from these, one may need load balancers, firewall, web proxy servers and routers in case the clusters are in separate networks or we want to enable Internet connectivity. The installation and configuration detail of all the specific software components is beyond the scope of this article. Refer to the installation guides from the individual product web sites.