Cluster Setups for Big Data Content Repository
Figure 2 shows a possible topology of various hardware and software components of the content repository in a production environment. Basically we require three clusters of multiple nodes. The first cluster, as shown in Figure 2 as Cluster-1, is a cluster of Hadoop nodes. In Hadoop cluster, it is required to set up one Hadoop master node (Node-1) and one, or more than one slave nodes. On Hadoop master node, HDFS Name-Node and MapReduce Job-Tracker services will run and on all the slaves, HDFS Data-Node service and MapReduce Task-Tracker service will run. Similarly, on Node-1, HBase master service will run and on the rest of the nodes HBase Region services will run. On all the HBase region servers, we can run Lily Repository services as recommended by the Lily repository documentation. We can configure any nodes (as per the recommendation only odd number of nodes, i.e. total 1 or 3 or 5 etc. nodes from the cluster) as ZooKeeper servers.
Figure 2: Topology of the Content Repository in Production
The second cluster will be for the Solr, as shown in Cluster-2, where we need to set up one Solr master and the rest as slaves. The third cluster will be for hosting the web service application and setting up document and search controllers. The number of nodes in this cluster will depend on the work load and total number of users we want to support at a time. Apart from these, one may need load balancers, firewall, web proxy servers and routers in case the clusters are in separate networks or we want to enable Internet connectivity. The installation and configuration detail of all the specific software components is beyond the scope of this article. Refer to the installation guides from the individual product web sites.