Use Big Data Technologies to Build a Content Repository Architecture

Use Big Data Technologies to Build a Content Repository Architecture

In today’s knowledge driven industry, a huge amount of digital information is generated, collected, processed and maintained for future use. It is one of the key activities and requirement in any organizations. The digital contents include various business processes and planning documents; product requirements, design and maintenance documents; user guides and support manuals; project management planning and execution reports, survey reports, customer or end user feedback and complaints, white papers, memos, archived emails, social media feeds, etc. Reliably storing, managing and making the required digital content available to authorized clients, both human and machine, is really a challenging task.

In this article we have proposed architecture of a digital content repository, which helps one to store and manage different types of digital content efficiently. The architecture meets the primary design goals of such solutions including scalability, reliability, extensibility, document versioning, ability to search stored content and user access control. It is based on various open source big-data solutions available in the current market, which makes the proposed architecture highly scalable, cost effective and minimum or no vendor lock-in with respect to new technology changes, software version upgrade or licensing.

Key Design Goals of Big Data Content Repository

Digital content is highly dynamic (the original contents and associated metadata keeps getting changed due to various business requirements) and bulky in nature (the size of the content vary, ranging from a few kilobytes to terabytes and maybe more!). Similarly one needs to maintain a backup of the original content for disaster recovery and handle the unexpected content loss scenario. To efficiently handle these requirements, architecture and design of any digital content management solution should have the following key properties to make it practically useful for serious business requirements.

  • Extensible: With the ever changing business requirements, there is always a need to change the original content or the metadata associated with it. The content repository should provide support for allowing such changes without affecting the existing content structure.
  • Reliable: Since many business activities and processes depend on the content repository, there must be no data loss of any kind and the repository should be highly available and fault tolerant (should be able to handle data corruption and total recovery).
  • Scalable: The repository should be highly scalable with respect to the storage capacity and amount of requests it can handle. Because of ever generating digital content out of various business processes, size of the stored content can grow rapidly and the storage limit should not be a roadblock for any content repository. Similarly, the architecture should be capable enough of handling a varying number of user requests.
  • Versioned: It is a very common practice in business, where the original content of the document keeps changing over a period of time, yet the user should still be able to access all of the versions (changes) of the document. For example, the project requirements document or design documents change over a period of time. The content repository should support versioning of documents similar to the one used for the computer program source code files maintained in the version controlled system (e.g. CVS, SVN or Git).
  • Controlled access: The read/write/delete operations on a document should be done by a user if and only if he/she has proper access permissions to do so. Only an authorized user should be allowed to access or modify the document or related metadata, which is very critical for any content repository in a production environment.
  • Searchable:The repository should provide an interface through which users can search all or a subset of the documents and metadata stored in the repository using search keywords. Without this capability, it is very difficult to find specific content having some random words in it.
  • Cost efficient: The selection of software components used for building the repository should be cost effective, and minimum or no vendor lock with respect to new technology changes, software version upgrade or licensing.

By considering the above mentioned design goals, the following section describes the high-level architecture of the proposed content repository.

Big Data Content Repository Architecture

The diagram in figure 1 shows the high-level software architecture of the proposed content repository. It shows various software components used in building the repository with the flow of command and data. Most of the software components are open source software or libraries, which use Linux as its operating system platform.

High-Level Software Architecture of the Content Repository
Figure 1: High-Level Software Architecture of the Content Repository

All the end user requests for basic CRUD operations or search queries are handled by a web service module called User Request Handler. The request are filtered by the Access Control module, which only allows operations with proper user permissions to pass through and make an appropriate call, either to document controller or search controller. The document controller module makes calls to Lily repository for creating, reading, updating or deleting records. Lily repository stores content either in Hbase or in HDFS based on the content size. The search controller acts as a proxy to Apache solr and provides uniform access to make search queries to Solr. It hides all the Solr specific complexities from other client services.

Following are descriptions of various components in Figure 1.

  • User Request Handler: This is the only component in the Client Support System with which all the repository clients (human or machine) will interact. It is a custom written SOAP based web service or RESTful (Representational State Transfer) web service, which allows users to do document upload, download, delete, update and search related functionalities. The service can be hosted on any open source web containers such as JBoss server or Apache Tomcat.
  • Access Control and User Database: To perform any operation in the content repository, a user should have the necessary permissions. This is very important for any serious business application scenario. The access control module serves the same purpose. It filters all the user requests coming through User Request Handler before delivering to either Document Controller or Search Controller. Any request coming through the user without proper access permissions will be blocked and the appropriate error response status will be sent back to the user. The User Database can be an existing Organization wide User Access Directory or repository specific user database, which stores user login credentials and repository access permissions.
  • Document Controller: It can be a Java based module (since at present Lily repository only supports the Java client API) or a REST client service written in any programming language, which makes call to Lily REST service to perform basic record level create, read, update or delete operations.
  • Lily Repository: Lily is a distributed and scalable content repository, based on the Apache open source big data platform called Hadoop, for storing, searching and retrieving content items, documents, or any binary objects. It is a highly distributed and cloud-scale server application that fuses HBase and Solr. The Lily repository is designed to be used by any kind of front-end applications using either the Java based Lily API or through Lily service REST interface.

    ?The proposed architecture uses Lily’s inherent capability of scalability, record versioning and reliability. Various features of the Lily repository, such as Write Ahead Log, message queue, indexer modules, etc. make it very consistent, reliable and fault tolerant. The indexer module sends the record data to Solr for indexing purposes, which later can be searched through the Solr search interface.

  • Hbase: Hbase is an open source, NoSQL or non-relational, highly scalable and distributed database that runs on top of HDFS (Hadoop Distributed Filesystem). It is written in Java. Its design is based on Google’s BigTable architecture.
  • Zookeeper: ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services. All of these kinds of services are used by various distributed applications for which ZooKeeper acts as a coordinator of distributed applications.
  • Hadoop: Apache Hadoop is a Java based software library and a framework that allows distributed processing of large data sets across clusters of multiple processing and storage nodes using a programming model called MapReduce.
  • HDFS: HDFS is the Hadoop distributed file system used for storing large data files on cluster nodes. It is a highly reliable, fault tolerant and scalable file storage system.
  • Apache Solr: Solr is an Apache Lucene based open source enterprise search platform with all the major search related features, such as full-text and faceted search, key highlighting, dynamic clustering, near real time and distributed indexing, load balancing, etc. It can support a rich set of document formats, such as Word, PDF, HTML, etc.
  • Search Controller: This module takes the search query from the user request handler. It then translates the user request to an appropriate Solr search query, makes a search call to Solr,? collects, filters and processes the search results and sends the response back to user request handler. It acts as a proxy between the user request handler and Apache Solr.

The use of Hadoop, HDFS, ZooKeeper and Hbase make the proposed architecture highly reliable and scalable. Lily provides the extensibility and versioning of content and associated metadata. The use of Apache Solr makes all the content searchable. The access control module makes sure that no unauthorized access is possible to any content of the repository. Since all the software components used are enterprise class open source software, they are reliable, available free of cost and without any licensing limitations. The use of open source components avoids the vender lock situation and if needed, one can change the underlying software at any time (e.g. one can easily move from JBoss to Apache Tomcat Java container with no cost and minimum or no change in hosted web application code).

Cluster Setups for Big Data Content Repository

Figure 2 shows a possible topology of various hardware and software components of the content repository in a production environment. Basically we require three clusters of multiple nodes. The first cluster, as shown in Figure 2 as Cluster-1, is a cluster of Hadoop nodes. In Hadoop cluster, it is required to set up one Hadoop master node (Node-1) and one, or more than one slave nodes. On Hadoop master node, HDFS Name-Node and MapReduce Job-Tracker services will run and on all the slaves, HDFS Data-Node service and MapReduce Task-Tracker service will run. Similarly, on Node-1, HBase master service will run and on the rest of the nodes HBase Region services will run. On all the HBase region servers, we can run Lily Repository services as recommended by the Lily repository documentation. We can configure any nodes (as per the recommendation only odd number of nodes, i.e. total 1 or 3 or 5 etc. nodes from the cluster) as ZooKeeper servers.

Topology of the Content Repository in Production
Figure 2: Topology of the Content Repository in Production

The second cluster will be for the Solr, as shown in Cluster-2, where we need to set up one Solr master and the rest as slaves. The third cluster will be for hosting the web service application and setting up document and search controllers. The number of nodes in this cluster will depend on the work load and total number of users we want to support at a time. Apart from these, one may need load balancers, firewall, web proxy servers and routers in case the clusters are in separate networks or we want to enable Internet connectivity. The installation and configuration detail of all the specific software components is beyond the scope of this article. Refer to the installation guides from the individual product web sites.

devx-admin

devx-admin

Share the Post:
Clean Energy Adoption

Inside Michigan’s Clean Energy Revolution

Democratic state legislators in Michigan continue to discuss and debate clean energy legislation in the hopes of establishing a comprehensive clean energy strategy for the

Chips Act Revolution

European Chips Act: What is it?

In response to the intensifying worldwide technology competition, Europe has unveiled the long-awaited European Chips Act. This daring legislative proposal aims to fortify Europe’s semiconductor

Revolutionized Low-Code

You Should Use Low-Code Platforms for Apps

As the demand for rapid software development increases, low-code platforms have emerged as a popular choice among developers for their ability to build applications with

Global Layoffs

Tech Layoffs Are Getting Worse Globally

Since the start of 2023, the global technology sector has experienced a significant rise in layoffs, with over 236,000 workers being let go by 1,019

Clean Energy Adoption

Inside Michigan’s Clean Energy Revolution

Democratic state legislators in Michigan continue to discuss and debate clean energy legislation in the hopes of establishing a comprehensive clean energy strategy for the state. A Senate committee meeting

Chips Act Revolution

European Chips Act: What is it?

In response to the intensifying worldwide technology competition, Europe has unveiled the long-awaited European Chips Act. This daring legislative proposal aims to fortify Europe’s semiconductor supply chain and enhance its

Revolutionized Low-Code

You Should Use Low-Code Platforms for Apps

As the demand for rapid software development increases, low-code platforms have emerged as a popular choice among developers for their ability to build applications with minimal coding. These platforms not

Cybersecurity Strategy

Five Powerful Strategies to Bolster Your Cybersecurity

In today’s increasingly digital landscape, businesses of all sizes must prioritize cyber security measures to defend against potential dangers. Cyber security professionals suggest five simple technological strategies to help companies

Global Layoffs

Tech Layoffs Are Getting Worse Globally

Since the start of 2023, the global technology sector has experienced a significant rise in layoffs, with over 236,000 workers being let go by 1,019 tech firms, as per data

Huawei Electric Dazzle

Huawei Dazzles with Electric Vehicles and Wireless Earbuds

During a prominent unveiling event, Huawei, the Chinese telecommunications powerhouse, kept quiet about its enigmatic new 5G phone and alleged cutting-edge chip development. Instead, Huawei astounded the audience by presenting

Cybersecurity Banking Revolution

Digital Banking Needs Cybersecurity

The banking, financial, and insurance (BFSI) sectors are pioneers in digital transformation, using web applications and application programming interfaces (APIs) to provide seamless services to customers around the world. Rising

FinTech Leadership

Terry Clune’s Fintech Empire

Over the past 30 years, Terry Clune has built a remarkable business empire, with CluneTech at the helm. The CEO and Founder has successfully created eight fintech firms, attracting renowned

The Role Of AI Within A Web Design Agency?

In the digital age, the role of Artificial Intelligence (AI) in web design is rapidly evolving, transitioning from a futuristic concept to practical tools used in design, coding, content writing

Generative AI Revolution

Is Generative AI the Next Internet?

The increasing demand for Generative AI models has led to a surge in its adoption across diverse sectors, with healthcare, automotive, and financial services being among the top beneficiaries. These

Microsoft Laptop

The New Surface Laptop Studio 2 Is Nuts

The Surface Laptop Studio 2 is a dynamic and robust all-in-one laptop designed for creators and professionals alike. It features a 14.4″ touchscreen and a cutting-edge design that is over

5G Innovations

GPU-Accelerated 5G in Japan

NTT DOCOMO, a global telecommunications giant, is set to break new ground in the industry as it prepares to launch a GPU-accelerated 5G network in Japan. This innovative approach will

AI Ethics

AI Journalism: Balancing Integrity and Innovation

An op-ed, produced using Microsoft’s Bing Chat AI software, recently appeared in the St. Louis Post-Dispatch, discussing the potential concerns surrounding the employment of artificial intelligence (AI) in journalism. These

Savings Extravaganza

Big Deal Days Extravaganza

The highly awaited Big Deal Days event for October 2023 is nearly here, scheduled for the 10th and 11th. Similar to the previous year, this autumn sale has already created

Cisco Splunk Deal

Cisco Splunk Deal Sparks Tech Acquisition Frenzy

Cisco’s recent massive purchase of Splunk, an AI-powered cybersecurity firm, for $28 billion signals a potential boost in tech deals after a year of subdued mergers and acquisitions in the

Iran Drone Expansion

Iran’s Jet-Propelled Drone Reshapes Power Balance

Iran has recently unveiled a jet-propelled variant of its Shahed series drone, marking a significant advancement in the nation’s drone technology. The new drone is poised to reshape the regional

Solar Geoengineering

Did the Overshoot Commission Shoot Down Geoengineering?

The Overshoot Commission has recently released a comprehensive report that discusses the controversial topic of Solar Geoengineering, also known as Solar Radiation Modification (SRM). The Commission’s primary objective is to

Remote Learning

Revolutionizing Remote Learning for Success

School districts are preparing to reveal a substantial technological upgrade designed to significantly improve remote learning experiences for both educators and students amid the ongoing pandemic. This major investment, which

Revolutionary SABERS Transforming

SABERS Batteries Transforming Industries

Scientists John Connell and Yi Lin from NASA’s Solid-state Architecture Batteries for Enhanced Rechargeability and Safety (SABERS) project are working on experimental solid-state battery packs that could dramatically change the

Build a Website

How Much Does It Cost to Build a Website?

Are you wondering how much it costs to build a website? The approximated cost is based on several factors, including which add-ons and platforms you choose. For example, a self-hosted

Battery Investments

Battery Startups Attract Billion-Dollar Investments

In recent times, battery startups have experienced a significant boost in investments, with three businesses obtaining over $1 billion in funding within the last month. French company Verkor amassed $2.1

Copilot Revolution

Microsoft Copilot: A Suit of AI Features

Microsoft’s latest offering, Microsoft Copilot, aims to revolutionize the way we interact with technology. By integrating various AI capabilities, this all-in-one tool provides users with an improved experience that not

AI Girlfriend Craze

AI Girlfriend Craze Threatens Relationships

The surge in virtual AI girlfriends’ popularity is playing a role in the escalating issue of loneliness among young males, and this could have serious repercussions for America’s future. A

AIOps Innovations

Senser is Changing AIOps

Senser, an AIOps platform based in Tel Aviv, has introduced its groundbreaking AI-powered observability solution to support developers and operations teams in promptly pinpointing the root causes of service disruptions