Approaches to Indexing Multiple Logs File Types in Solr and Setting up a Multi-Node, Multi-Core Solr Cloud

Approaches to Indexing Multiple Logs File Types in Solr and Setting up a Multi-Node, Multi-Core Solr Cloud

Introduction

Apache Solr is a widely used open source search platform that internally uses Apache Lucene based indexing. Solr is very popular and provides a database to store indexed data and is a very high scalable, capable search solution for the enterprise platform. This article provides a basic vision for a single and multi-core approach to indexing and querying multiple log file types in Solr. Solr indexes the log files generated by the servers and allows searching the logs for troubleshooting. It has the capability to scale to work in a multi-node cluster set up in a distributed and fault tolerant manner. These capabilities are collectively called SolrCloud. Solr uses Zookeeper for working in a distributed manner

Approaches to Indexing Multiple Log File Types

Single Solr schema to Index disparate log files types

In this first approach to Solr index; each set of log types has single index and schema definition. In Solr setup each single core is associated with one Solr schema definition and configuration. This figure shows the high level architecture of single Solr index with a different set of log file types. For instance, here we used it to index web and application server logs; these logs are getting indexed based on the schema definition.


Figure 1. Single Solr schema and index for multiple log types

We define the fields of the log file document that need to be indexed in the schema.xml file. As the log files get indexed and it generates the index file, the generated index resides in the Solr core data folder.

Consider these log files different set of fields. For example:

Web log fields: date, time, time-taken, cs-method, cs-uri, sc-status, etc.

App log fields: date, time taken, server-name, server-ip, site-name, cs-method, etc.

There are two ways we can generate the schema file for indexing:

  • If the fields are the same in web and app server logs, then we can directly define the filed names and type in the schema file for indexing
  • If the fields are unique in web and app server logs, then define the fields as dynamic fields in the schema file for indexing

Solr Schema Definition

In the Solr schema definition file, define the required common and dynamic fields and types. Define which field should be used as the unique key and then define how to index and search the fields using Solr query.

Schema.xml

    	                 	   	   	   	 	               uid everything 

Consider this different kind of log file type indexing scenario. We can generate a separate schema (multiple index) for each log file type or you should merge fields into a single index. In this single index, require the identifier filed to identify the log file type whether it is a web logs or app logs type.

Sample program to generate an index for different log file types

Refer to this link to set up, configure and start the Solr. Create the logsearch under the Solr folder and generate the schema file for indexing logs data. Start the Solr and use the link to view the Solr UI.

We have created a sample Java program to index the different kind of log data. This client program generates an index and the generated indexes located in Solr data index location.

package com.apachesolr.infy.client;public class GenerateSolrIndex {	/**** GET solr connection		@return		@throws Exception *****/public static HttpSolrServer getSolrConnection() throws Exception {	HttpSolrServer solrServer = null;	try {// configure a server object with actual solr values.		if (solrServer == null) {		solrServer = new HttpSolrServer(	"http://localhost:8983/solr/logsearch");		solrServer.setParser(new XMLResponseParser());		}} catch (Exception exc) {			exc.printStackTrace();	}		return solrServer;	}	/****Index Web logs and App logs data		@return		@throws Exception *****/public Collection addWebLogData() throws Exception {	File file = new File("D:\Solarsetups\samplelogs\Weblog.log");	BufferedReader bufferedReader = null;	Collection inputDocuments = new ArrayList();	String logtype = "weblogs";	InputStream is = null;	int i = 0;	int j = 0;	is = new FileInputStream(file);	bufferedReader = new BufferedReader(new InputStreamReader(is));	for (String line; (line = bufferedReader.readLine()) != null && i < 10000;) { 		SolrInputDocument inputDocument = new SolrInputDocument();		i++;j++;		String[] fields = line.split("	");		if (fields.length == 6) {		inputDocument.addField("uid", "web" + i);		inputDocument.addField("logtype", logtype);		inputDocument.addField("weblog_date_tc",	StringUtils.trim(fields[0]));		inputDocument.addField("weblog_time_tc",	StringUtils.trim(fields[1]));		inputDocument.addField("timetaken_string",	StringUtils.trim(fields[2]));		inputDocument.addField("csmethod_string",	StringUtils.trim(fields[3]));		inputDocument.addField("csuri_string_tc", StringUtils.trim(fields[4]));		inputDocument.addField("scstatus_string",	StringUtils.trim(fields[5]));		inputDocuments.add(inputDocument);		// getSolrConnection().add(inputDocument);		// getSolrConnection().commit();	}		} return inputDocuments;	}		/******** to test the indexed data		@param args,@throws MalformedURLException	 	@throws IOException, @throws ParseException ************/public static void main(String[] args) throws MalformedURLException, IOException, ParseException {	GenerateSolrIndex generateSolrIndex = new GenerateSolrIndex();	try {		getSolrConnection().add(generateSolrIndex.addWebLogData());		} catch (Exception e) {			e.printStackTrace();	}	}}

Solr Query

Solr provides a UI to test, debug and set the request query parameters in Solr for searching indexed data, you can find it here. We can pass the query string parameter in q box such as logtype: weblogs or directly we can pass the string as weblogs or Applogs.

Search indexed data by log file type

Weblogs Returns only web server indexed log data
Applogs Returns only App server indexed log data
*:* Returns both app and web log data

Multiple Solr schema to Index disparate log files types

This is the second approach for a Solr index. Each log file type has a separate index and schema definition. In the Solr setup single core is associated with one Solr schema definition and configuration. This figure shows a high level architecture of multiple Solr indexes with different sets of log file types. For instance here we used it to get indexed with web and app server logs. These logs are getting indexed based on the schema definition.


Figure 2. Each log type has separate Solr schema and index

Setting up Multi-Node, Multi-Core Solr Cloud

Why we need SolrCloud

In a typical enterprise scenario millions of documents may need to be indexed and index sizes may go up to 100s of GBs. In this situation, we will need the indexes to be distributed across servers as well as require replication so that the set up can handle fail over scenarios. A distributed structure also helps to share the load of search queries across the servers. In the context of SolrCloud we come across terms like index, Solr Core and collection.

A single core, single instance Solr set up will be associated with a single schema as defined in the schema.xml file of Solr. We define the fields of the document that need to be indexed in the schema.xml file. As the documents get indexed it generates the index that resides in the designated data folder of the Solr Instance.

As we may need to scale the Solr set up to achieve the quality of service requirements such as high availability, 100s of users querying across 100s of GBs of indexed files, we will need to scale the solution to work in a distributed manner. In such scenario we will need to have distributed indexes across multiple Solr instances that may be on a single physical server or multiple servers. This capability of Solr is called SolrCloud.

Multiple Solr instances are managed by Zookeeper servers. Solr comes packaged with embedded zookeeper servers. The user also has the option to use an external zookeeper.


Figure 3. Single core Solr instance

When do you need a multi-core Solr setup?

In an organization there will be disparate kind of documents that need to be indexed for different purposes. For example, in a news portal site there might be a need to index the articles that the organization publishes as well a need to index the webserver logs for searching log files. In this case we need completely different set fields that we will be interested to index. As a result, we need not have separate Solr instances altogether for indexing. We will need two separate cores running as a part of single Solr instance. Each core will be associated with a single index. For example, the LogCore will maintain the index of the application log files and the ArticleCore will maintain the index of the news articles the news portal publishes.

A collection in Solr context is one logical index that may have been distributed across multiple Solr cores. In our example, if we scale the multi-core Solr set up to multiple Solr instances, we will have two collections. One will be an article collection and the other one will be a log collection.


Figure 4. Multi-core Solr instance

How to create a multi-core, multi-instance Solr set up

Here we will take the above example and create two cores in our Solr instance. One core for storing the articles which we will name as 'articlecore' and the other core which will contain the index of logs will be named as 'logcore'.

The easiest way to set up the multi-core Solr set up is to modify the multi-core example that comes as a part of the Solr distribution. For example, if we unzip the Solr distribution zip files at location Solr-4.1.0

Then go to the /Solr-4.1.0/example/multicore folder. It contains the solr.xml which needs to be modified to handle multiple cores as per our requirement.

By default the multi-core example is to handle two cores namely core0 and core1. For our example, let's rename the 'core0' folder to 'articlecore' and the 'core1' folder to 'logcore'.

Then we need to modify the solr.xml file.

So now we have one of the instances ready. Our objective is to build a multi-instance, multi-core set up. So we need similar configuration in another Solr set up. If we are setting up the multi-node Solr in two servers named 'searchbox1' and 'searchbox2'. Then the configuration mentioned above needs to be completed in both the boxes.

As mentioned earlier, we need a zookeeper server to take care of the cluster setup. Solr comes with its own embedded zookeeper. A single zookeeper can manage both of the Solr instances or we can use an ensemble of zookeepers to manage the Solr instance. For our example we will use a single zookeeper instance that will run on 'searchbox1'.

On the example directory we can run the following command to start Solr instance in 'searchbox1'.

java -Dsolr.solr.home=multicore  -Dbootstrap_conf=true -DzkRun=searchbox1:9983 -DzkHost=searchbox1:9983  -DnumShards=2 -jar start.jar

Details of Arguments:

-Dsolr.solr.home=multicore: This argument will tell Solr to use the multicore folder under example to be used as the Solr home. By default the Solr folder under example is used as the default Solr home. This way the solr.xml configuration changes that we did to tell that two cores namely 'articlecore' and 'logcore' will get started along with the Solr instance.

-Dbootstrap_conf=true: This argument tells Solr to upload the Solr configurations to zookeeper server.

-DzkRun=searchbox1:9983: This argument tells embedded zookeeper to be started in port number 9983

-DzkHost=searchbox1:9983: This argument tells Solr that the zookeeper that will manage the Solr instance is running in 'searchbox1' on port number 9983

-DnumShards=2: This argument tells Solr that there will be 2 shards or instances as a part of this set up

Once the Solr instance is up in 'searchbox1', we need to go to example folder 'searchbox2' and use the command:

java -DzkHost=searchbox1:9983 -Dsolr.solr.home=multicore -jar start.jar

Details of Arguments:

-DzkHost=searchbox1:9983: This command tells this instance of Solr to use the zookeeper in 'searchbox1'

Once this command is run, 'searchbox2' joins with 'searchbox1' to create a cluster. The 'articlecore' in 'searchbox1' and 'searchbox2' will be one logical index or collection. The same is the case with 'logcore'.

?

Kalpana Cis a Technology Analyst with the ILCLOUD at Infosys Labs. She has a decade of experience in Java/J2EE, Big Data related frameworks and technologies.

Priyadarshi Sahoo is a Technology Lead at Infosys Ltd. He has more than 8 years of experience in Java/J2EE related technologies.

devx-admin

devx-admin

Share the Post:
Economy Act Soars

Virginia’s Clean Economy Act Soars Ahead

Virginia has made significant strides towards achieving its short-term carbon-free objectives as outlined in the Clean Economy Act of 2020. Currently, about 44,000 megawatts (MW)

Renewable Storage Innovation

Innovative Energy Storage Solutions

The Department of Energy recently revealed a significant investment of $325 million in advanced battery technologies to store excess renewable energy produced by solar and

Development Project

Thrilling East Windsor Mixed-Use Development

Real estate developer James Cormier, in collaboration with a partnership, has purchased 137 acres of land in Connecticut for $1.15 million with the intention of

USA Companies

Top Software Development Companies in USA

Navigating the tech landscape to find the right partner is crucial yet challenging. This article offers a comparative glimpse into the top software development companies

Software Development

Top Software Development Companies

Looking for the best in software development? Our list of Top Software Development Companies is your gateway to finding the right tech partner. Dive in

Economy Act Soars

Virginia’s Clean Economy Act Soars Ahead

Virginia has made significant strides towards achieving its short-term carbon-free objectives as outlined in the Clean Economy Act of 2020. Currently, about 44,000 megawatts (MW) of wind, solar, and energy

Renewable Storage Innovation

Innovative Energy Storage Solutions

The Department of Energy recently revealed a significant investment of $325 million in advanced battery technologies to store excess renewable energy produced by solar and wind sources. This funding will

Renesas Tech Revolution

Revolutionizing India’s Tech Sector with Renesas

Tushar Sharma, a semiconductor engineer at Renesas Electronics, met with Indian Prime Minister Narendra Modi to discuss the company’s support for India’s “Make in India” initiative. This initiative focuses on

Development Project

Thrilling East Windsor Mixed-Use Development

Real estate developer James Cormier, in collaboration with a partnership, has purchased 137 acres of land in Connecticut for $1.15 million with the intention of constructing residential and commercial buildings.

USA Companies

Top Software Development Companies in USA

Navigating the tech landscape to find the right partner is crucial yet challenging. This article offers a comparative glimpse into the top software development companies in the USA. Through a

Software Development

Top Software Development Companies

Looking for the best in software development? Our list of Top Software Development Companies is your gateway to finding the right tech partner. Dive in and explore the leaders in

India Web Development

Top Web Development Companies in India

In the digital race, the right web development partner is your winning edge. Dive into our curated list of top web development companies in India, and kickstart your journey to

USA Web Development

Top Web Development Companies in USA

Looking for the best web development companies in the USA? We’ve got you covered! Check out our top 10 picks to find the right partner for your online project. Your

Clean Energy Adoption

Inside Michigan’s Clean Energy Revolution

Democratic state legislators in Michigan continue to discuss and debate clean energy legislation in the hopes of establishing a comprehensive clean energy strategy for the state. A Senate committee meeting

Chips Act Revolution

European Chips Act: What is it?

In response to the intensifying worldwide technology competition, Europe has unveiled the long-awaited European Chips Act. This daring legislative proposal aims to fortify Europe’s semiconductor supply chain and enhance its

Revolutionized Low-Code

You Should Use Low-Code Platforms for Apps

As the demand for rapid software development increases, low-code platforms have emerged as a popular choice among developers for their ability to build applications with minimal coding. These platforms not

Cybersecurity Strategy

Five Powerful Strategies to Bolster Your Cybersecurity

In today’s increasingly digital landscape, businesses of all sizes must prioritize cyber security measures to defend against potential dangers. Cyber security professionals suggest five simple technological strategies to help companies

Global Layoffs

Tech Layoffs Are Getting Worse Globally

Since the start of 2023, the global technology sector has experienced a significant rise in layoffs, with over 236,000 workers being let go by 1,019 tech firms, as per data

Huawei Electric Dazzle

Huawei Dazzles with Electric Vehicles and Wireless Earbuds

During a prominent unveiling event, Huawei, the Chinese telecommunications powerhouse, kept quiet about its enigmatic new 5G phone and alleged cutting-edge chip development. Instead, Huawei astounded the audience by presenting

Cybersecurity Banking Revolution

Digital Banking Needs Cybersecurity

The banking, financial, and insurance (BFSI) sectors are pioneers in digital transformation, using web applications and application programming interfaces (APIs) to provide seamless services to customers around the world. Rising

FinTech Leadership

Terry Clune’s Fintech Empire

Over the past 30 years, Terry Clune has built a remarkable business empire, with CluneTech at the helm. The CEO and Founder has successfully created eight fintech firms, attracting renowned

The Role Of AI Within A Web Design Agency?

In the digital age, the role of Artificial Intelligence (AI) in web design is rapidly evolving, transitioning from a futuristic concept to practical tools used in design, coding, content writing

Generative AI Revolution

Is Generative AI the Next Internet?

The increasing demand for Generative AI models has led to a surge in its adoption across diverse sectors, with healthcare, automotive, and financial services being among the top beneficiaries. These

Microsoft Laptop

The New Surface Laptop Studio 2 Is Nuts

The Surface Laptop Studio 2 is a dynamic and robust all-in-one laptop designed for creators and professionals alike. It features a 14.4″ touchscreen and a cutting-edge design that is over

5G Innovations

GPU-Accelerated 5G in Japan

NTT DOCOMO, a global telecommunications giant, is set to break new ground in the industry as it prepares to launch a GPU-accelerated 5G network in Japan. This innovative approach will

AI Ethics

AI Journalism: Balancing Integrity and Innovation

An op-ed, produced using Microsoft’s Bing Chat AI software, recently appeared in the St. Louis Post-Dispatch, discussing the potential concerns surrounding the employment of artificial intelligence (AI) in journalism. These

Savings Extravaganza

Big Deal Days Extravaganza

The highly awaited Big Deal Days event for October 2023 is nearly here, scheduled for the 10th and 11th. Similar to the previous year, this autumn sale has already created

Cisco Splunk Deal

Cisco Splunk Deal Sparks Tech Acquisition Frenzy

Cisco’s recent massive purchase of Splunk, an AI-powered cybersecurity firm, for $28 billion signals a potential boost in tech deals after a year of subdued mergers and acquisitions in the

Iran Drone Expansion

Iran’s Jet-Propelled Drone Reshapes Power Balance

Iran has recently unveiled a jet-propelled variant of its Shahed series drone, marking a significant advancement in the nation’s drone technology. The new drone is poised to reshape the regional