devxlogo

Implement Big Data Solutions for an Enterprise Content Repository

Implement Big Data Solutions for an Enterprise Content Repository

In “Use Big Data Technologies to Build a Content Repository Architecture,” we have explained the key design goals and requirements for a reliable, scalable and extensible content repository. We have also provided the architecture of such a repository using custom and open source big data solutions. With that we have also explained how one can deploy various software components on a cluster of nodes.

What we didn’t mentioned there was how to create and define the structure of the content and associated metadata for the repository, how to upload the same from the client application, where the repository will store it and how the same will be available for search. An explanation of all of these with how to configure Lily repository for storing custom formatted content (in the form of Lily repository records), how to configure Lily repository to send stored data for indexing to Solr and a detailed explanation of all the internal data structure and configurations with an example use case is the subject of this article.

As per Figure 1 of the previous article, the key component where one needs to implement from scratch is the “Client Support System.” Hence, in this article we have explained the configuration and implementation of various components of the client support system.

Big Data Content Repository Use Case

To describe various configuration and implementation concepts of the proposed repository components, let’s begin with a practical use case. Let’s say we want to store multiple types of documents or files with associated metadata into the repository. The associated metadata could be the title of the document, a list of author names, date of creation and last update, a description of the document and keywords or tags. The following table summarizes the same:

Table-1: Document and associated metadata

Fields

Data type

Content

a content file

BLOB

Associated Metadata

file name

STRING

file type

STRING

Title

STRING

Description

STRING

keywords or tags

ARRAY

Authors

ARRAY

created on

DATATIME

last modified on

DATATIME

last modified by

STRING

As per the architecture we have proposed, we will store all the user provided information into the Lily repository. The repository model of Lily offers rich field and record types, flexible schema, multi-value fields, versioning of records and fields, and support for different languages; thus, it is more suitable for storing structured as well as unstructured content. The following section explains the same in greater detail.

The Lily Repository Configuration

As compared to many traditional content repositories (e.g. Apache Jackrabbit), which uses a file system metaphor for the structure of their repository, there is no such hierarchy in the Lily repository. This frees the user to think about a primary organization of the content into the repository. In the case of the Lily repository, it’s not required to decide where in the hierarchy to store each created entity, since the Lily repository is like one big collection of records and the user can access any record using the record id provided at the time of creation of the record. The unit for any CRUD operations in Lily repository is a Lily record. Lily record is a collection of Lily fields. Hence, to store the document and associated metadata into the Lily repository, we first have to create Lily fields and record types.

Creating Schema in Lily Repository for Storing Documents

To store contents in the Lily, first we have to create fields for each content type. Lily supports the Java based API or REST interfaces for all the repository operations. But to make the explanation simple, we will use the import tool provided by Lily repository, called “lily-import.” It takes the input as a JSON file having the schema of all the fields and record types as listed in Table-1.

Create a file, e.g. document_scema.json in the user’s home folder, with the above contents and run the following command in the Lily installation directory to create the schema for the document and associated metadata.

$ bin/lily-import -s ~/ document_scema.json

Example Code

Creating Records in Lily Repository

Once the schema is created in Lily repository, we can create the records. Let’s use the Linux cURL utility and POST the record contents through the REST interface exposed by Lily repository. In our use case, creating a record with a blob field “file_content” can be achieved in two steps: (1) Upload the document file as a blob and (2) create the “Document” record with a reference to the uploaded blob. A blob is uploaded by POSTing it to the Lily repository /repository/blob REST resource.

Upload the File as a Blob

To create a new blob, the POST request must specify the HTTP headers “Content-Type” and “Content-Length”, as these header values will be used to determine the blob storage location, either HDFS or HBASE, based on blob size. Here the ‘curl’ command will automatically send the “Content-Length” header based on the size of the file Article.doc. Here it is assumed that the Lily server is running on localhost with port number 12060.

$ curl -XPOST localhost:12060/repository/blob --data-binary @/home/user/Article.doc 
-H 'Content-Type: application/msword' -D -

Following is the response from Lily server for the above POST request.

HTTP/1.1 200 OKContent-Type: application/json; charset=UTF-8Date: Mon, 25 Mar 2013 9:27:36 GMTAccept-Ranges: bytesServer: Restlet-Framework/2.0snapshotContent-Length: 91{ "value": "RwByAGUAYQB0ACAAUAByAGUAcwBlAG4Ad", "mimeType": "application/msword", "size": 5686365}

The last five lines of JSON in the response is the one we will be using as the value in the “file_content” blob while creating the document record.

Create the “Document” Record

Run the following command to create a record in Lily repository for a record type “document_ns$Document”

Example Code

To update the existing record, we can use the PUT method and to delete the existing record, we can make the HTTP DELETE call to the Lily REST interface. The details about the syntax of all the methods are provided at the Lily documentation site.

Linking Lily with Apache Solr

As per the proposed architecture in “Use Big Data Technologies to Build a Content Repository Architecture,” we have used Apache Solr as the search engine where all the documents and associated metadata is pushed for indexing. There is a module called “Indexer” in Lily repository, which is responsible for keeping the Solr-index up to date whenever a record is created, updated or deleted. The job of the indexer is to map Lily records onto Solr documents based on configuration. It determines the set of records and the fields of the record, which are needed to be indexed. The indexer module extracts the content from the blob fields, using the Tika library. But to make this happen, we have to tell Lily repository which Solr to use for indexing. The same can be achieved by running the following command from the Lily installation directory.

$ bin/lily-add-index -z localhost:2181 -c samples/dynamic_indexerconf/dynamic_indexerconf.xml 
-n genericindex -s shard1:http://localhost:8983/solr

Here, -z option specifies host:port pair of the zookeeper service, and shard1 is the URL of the running Solr server. The file dynamic_indexerconf.xml is the file having information about all of the fields and indexes we want Solr to index. The file mentioned here is available in the Lily installation and is configured to index every field and record stored in Lily. To query the Solr search engine, please refer the Apache Solr tutorial documentation.

RESTful Interfaces of the “User Request Handler”

All the user requests are handled by the “User Request Handler” module of the client support system. It could be a RESTful or SOAP based web service hosted on any of the popular Java web containers, e.g. JBoss or Apache Tomcat. Here we have explained “User Request Handler” as a RESTful web service implemented using Jersey. Jersey is an open source JAX-RS (JSR 311) reference implementation in Java for building RESTful Web services.

The following code snippet is the Java class representing the document resource with GET, POST and DELETE methods for downloading, uploading and deleting the document and associated metadata into Lily repository. The details implementation is skipped for brevity, since it varies from case to case based on underlying business requirements.

Example Code

Making a GET Call

To download the document from the repository, make the GET call to the service. Let’s assume that, the war file deployed on web container is with name “repository.war” on host localhost on port 8080, then the GET URL will be:

http://localhost:8080/repository/document/article_doc

Here, the record is identified by passing the documentId (in our case it is Article_doc) as the path parameter to the method. The downloadDocument() method of the DocumentResource class could send the requested document in HTTP response body and associated metadata in the respective headers to the client.

Making a POST Call

To upload the document into the repository, the client application needs to make a POST call to the repository service at following URL.

http://localhost:8080/repository/document

The actual document will come as request body and accessible in InputStream parameter and associated metadata will be passed as header values with the POST call. The uploadDocument() method will use this information and create a new Lily record of type record type “document_ns$Document” having document as BLOB and other metadata as various fields in the record. The method needs to assign a unique key to each Lily record and can pass the same to the client program via the Response object. Following is an example Java code snippet to make the POST call to the above mentioned POST URL of the repository service.

Example Code

Making a DELETE Call

The uploaded documents and metadata can be deleted by making a DELETE call on the following URL:

http://localhost:8080/repository/document

The user needs to send the document ID as part of HTTP header and the method replies with the operation status in Response. Following is an example Java code snippet to make the DELETE call to the above mentioned DELETE URL of the repository service.

Example Code

Summary

This article provided a brief outline of how one can configure various components in the proposed architecture of the Content Repository we have explained in “Use Big Data Technologies to Build a Content Repository Architecture.” Details about how to implement all of the modules varies from designer to designer. It also depends on the technology competence of the development team. What we tried to provide here is one possible approach of doing the same with one set of technology components.

In this article we have explained how to configure Lily repository to store custom record types with different fields. We have also shown how to configure Lily for pushing record content for indexing into Apache Solr. With that we have provided the signature of RESTful interfaces for uploading, downloading and deleting documents with associated metadata. We have also provided the Java based clients for POST and DELETE methods.

References

  1. Lily data repository documentation
  2. Tika library
  3. Jersey
  4. Apache Solr tutorial documentation
devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist