Java Application Instrumentation with Log4J

pplication metrics, such as performance metrics, are key to understanding and improving application efficiency. Profiling and monitoring tools yield valuable information on CPU and resource usage, including OS and network metrics, JVM metrics, component- or method-level metrics, application server metrics, and HTTP (Web Server) request metrics. Such metrics can be used to detect bottlenecks and forewarn of production issues.

Find out how J2EE profiling and monitoring tools really work, and then create your own using log interception and log4j. No expensive tools or vendor-specific solutions are needed, just lightweight statistic derivation over your most basic application-level information source—your logs.

Approaches to Metric Collection
You can extract application-centric metrics, including performance metrics, in many ways. However, no single approach provides a total end-to-end solution that covers all the aspects of an enterprise application. J2EE-centric tools typically leverage one or more of the following approaches:

  1. Log/trace scrapping
    • Web server logs that conform to well known formats such as the Common
    • Log Format (CLF) can be processed for information on HTTP operations.
    • Verbose GC output from the JVM can be analyzed for GC bottlenecks.
  2. JVM hooks
    • Profilers plug into the JVM itself using the Profiler Interface (JVMPI) (deprecated in 1.5) and Tool Interface (JVMTI) introduced in Java 1.5.
    • Java Management Extensions (JMX) to publish statistics and notify of state changes and error conditions
    • JDBC call interception to measure the efficiency and performance of SQL operations
  3. External services
    • Operating system services such as ps, top, netstat, vmstat, and sar for resource information, and telnet, rsh, and ftp for executing commands to extract such information remotely
    • SNMP
  4. Homegrown
    • Code interleaved with homegrown statistic collection and derivation functions, and typically reported via logs

The code instrumentation approach is commonly used to insert code that measures memory, threads, and timing at a very fine-grained level, usually the method level. Code instrumentation is done by either post-processing class files or substituting code on the fly inside the JVM using JVMPI/JVMTI. Once instrumented, metrics can be logged, cached, persisted, or in an agent architecture, forwarded onto to a separate server to manage. The advantage of an agent architecture is that metrics can be captured and collated across multiple sources or tiers.

Production System Concerns
Code instrumentation sounds nice in theory but it’s intrusive. You may not want to mess with your byte codes. Any type of code injection or modification—no matter how benign—can potentially change the way your application behaves. Log scraping is less intrusive, but verbose traces and logs come at a cost. Any process that increases I/O (Disk or network) or takes resources away from your application, either directly or indirectly, will impact performance.

Other issues with profiling and monitoring in production environments include:

  • Production platform outside your control
    • You may not have the option of deploying profiling tools onto your live systems.
    • Services such as rsh/telnet may be disabled or secured.
  • Tools intrusiveness
    • All approaches are intrusive to one extent or another—it is just a matter of degree. Tools can impact application performance. Measurement may affect the application itself or the actual metrics. Approaches such as web log scraping are external and therefore less intrusive but cannot tell you what’s happening deeper inside your application.

If you built statistics collection from the ground up [e.g., via some homegrown mechanism or perhaps Aspects (AOP)], then you can leverage it in production. The extensibility and intrusiveness of such approaches depends on the design. Hard-coded approaches can be inflexible, and unless the design allows for some sort of fine-grained metric configuration and control, then new metrics may require a code change.

Log Interception
Information related to application and performance metrics is sometimes already present in your application logs—you just need to be able to tap into them. Log interception is a variant of log scraping. Whereas most tools process log files after the fact, log4j enables you to intercept and process log messages on the fly without incurring the cost of I/O. Log4j runs in the same process space as the application, so a log interceptor is somewhat similar to an agent. It can access the JVM for information on memory, threads, and timings. The issue with log interception is data interpretation: how to turn log messages into useful metrics.

Unlike Web server logs, application logs are application and component specific. Although log4j facilitates message formatting, text generated from an application is typically unstructured and difficult to process. Given that you know which text is significant, one way to interpret such data is using regular expressions. Regular expressions are ideally suited for matching patterns in free-form text; log messages can be filtered on the fly for text that matches metric-related inputs. Matches can then be collated with JVM information and values scraped from the message to derive application metrics.

Log interception is simple, flexible, and lightweight. The I/O overheads are low and the actual message processing very efficient as regular expressions can be pre-compiled. It is a production-friendly approach that can be plugged into mature applications without having to modify source or byte-codes.

Implementing a Log Interceptor Using Log4
In order to step through an actual log interceptor implementation, let’s first establish some key abstractions (see Figure 1):

  • Reading
    A measurement of time, memory, and scrapped log value at a point in time
  • Metric
    A data-transfer object for one or two readings (A metric triggers a change to a statistic value.)
  • Filter
    A filter to be applied to every log message (It defines a match expression for detecting significant text and an optional scrape pattern for scraping a numeric value from a message.)
  • Statistic
    A user-defined statistic of interest (It specifies one or two filters and a calculator, for example, number of users logged onto the system or method duration.)
  • Calculator
    A component for deriving statistic results from metrics (For example, Count or Average)
  • MetricCollector
    The central component of the log interception framework, responsible for detecting matches, turning matches into readings and metrics, and deriving statistics
 
Figure 1. Class Diagram of Key Abstractions

Statistics may be classified as single or dual. Single statistics can be derived from a single match in a log (e.g., counts and rates) and therefore specify a single filter. Dual statistics can be derived from two related matches in a log (e.g., deltas, durations, memory consumption, and running counts) and specify two filters. Calculations such as minimum, maximum, and average can be applied to either type. The unit of measurement can be over whatever data the JVM or the log provides, including:

  • Time (e.g., Duration)
  • Memory (e.g., Memory consumption)
  • Occurrences: number of times a match occurs (e.g., count and running count)
  • A user defined value scraped from the log message itself
  • Combination of the above (e.g., rate: occurrences over time)

To understand how a statistic may be derived, take a simple example: number of users logged onto a system. Imagine that every time a user logs in or out a log message is recorded in the log of the form “login {user}” or “logout {user}”. To derive the statistic, each log message is filtered for occurrences of either pattern. If a login match is found, then the number of users is incremented. If a logout pattern is found, then the number of users is decremented.

Statistic Configuration
The simplest way to set up pattern matches is to adopt a configuration approach where statistics and patterns are configured in a property file. Regular expressions can be loaded and compiled when the log interceptor is created. (Note that if your logs are precisely structured, then you can use String functions in lieu of regular expressions.)

Define a statistic ‘UserCount’ to represent the number of users currently logged on to the system by setting up two filters, one for each pattern:

statistic.UserCount.description=Number of users logged onto the systemstatistic.UserCount.calculator=runningcountstatistic.UserCount.first.match=login {.*statistic.UserCount.second.match=logout .{*

Matches either pattern will be sent to the runningcount calculator to process.

Extending Log4j, Creating an Appender
To plug in your own log-processing component, create an Appender class that extends the log4j adapter class AppenderSkeleton. Log messages configured to go to the appender will be passed in via the append() method. To intercept log messages, implement the append() method and redirect messages into the metric collection mechanism. Be careful not to consume your own logs—ignore all messages that come from your own components:

public class Stat4jAppender extends AppenderSkeleton {	protected static String c;	public Stat4jAppender() {		c = Util.getCategory();	}	protected void append(LoggingEvent logEvent) {		// don't scrap our own logs		if (logEvent.categoryName.equals(c))			return;		// direct log to metric capture mechanism										  
MetricCollector.getInstance().applyLog(logEvent.getRenderedMessage()); }..}

To hook into log4j, configure an appender (stat4j) in your log4j configuration file (Click here to access the complete source for stat4j):

log4j.appender.stat4j=au.com.stat4j.log4j.Stat4jAppender

Configure the stat4j appender for all log categories that you want intercepted. To avoid the I/O associated with other appenders, set the additivity flag to false. This will guarantee that target log messages are sent to the log interceptor and not passed to any other appenders:

log4j.logger.test=DEBUG,stat4jlog4j.additivity.test=false

At this point you have set up a basic log interceptor using the log4j Appender mechanism. All logs logged to the test category will be sent to the stat4jAppender, which will forward messages into the metric collector class.

Detecting Matches and Generating Readings
Create a main processing class (MetricCollector) for managing log filtering, and reading generation and statistics derivation. Use a singleton pattern to ensure one instance per JVM. That way, data can be centralized and correlated across threads:

public class MetricCollector {public void applyLog(String log) {		// Iterate over filters and see if		// a match has occurred		for (int i = 0; i < filters.length; ++i) {			if (filters[i].isMatch(log)) {				generateMetrics(filters[i], log);			} //fi		} //rof	}..}

In order to be efficient, ensure that statistic filter patterns are compiled and cached. Duplicates should be consolidated. This is so that messages are processed once for each distinct pattern. If a match is detected then a reading can be generated for each statistic to which the pattern applies.

The statistic calculator, in conjunction with the statistic type (single or dual), will determine whether a single or a dual match is sufficient to trigger a change to a statistic value. A dual statistic always defines two filters (regular expressions). If a single match is sufficient, a metric can be constructed immediately and forwarded on to the appropriate calculator to process. If a dual reading is required, the first reading will be cached and only when the second filter matches will the second reading be created and a metric constructed and forwarded. This approach requires that readings for dual statistics be cached until the second reading is generated. The advantage of creating metrics across multiple readings is that changes in time, memory, or scraped values are encapsulated inside the metric. Imagine the MetricCollector as part of a larger agent-based architecture. Metrics can be collated at point of capture and then forwarded on to a centralized server. By deferring metric creation until the second reading, the amount of traffic between the agent (MetricCollector) and the server is reduced.

Thread-global vs. Thread-Local
For a dual statistic (duration, for example), you need to specify whether the measurement applies within a given thread (i.e., method call duration) or across all threads (e.g., user session duration). A typical enterprise application has multiple concurrent threads that may be calling the same method and logging the same messages. In order to generate an accurate metric for a method call duration, the start and end readings (matching start and end logs from the method itself) must be correlated within a particular thread. For low-level statistics, you cannot mix and match across threads.

Because the log interceptor runs in the same process space as the application, the thread that generates a log message passes into the interceptor. This means that the interceptor can leverage java ThreadLocal to cache all readings made for a particular thread. In a thread-pooled environment, this works quite well as the number of threads—and therefore the number of caches—is gated.

In the MetricCollector, back up your reading cache with ThreadLocal. If the statistic is collected locally, use the ThreadLocal cache. Otherwise, default to the global cache:

public class MetricCollector {...	// Reading caches - one for within threads	// and the other global	protected ThreadLocal localReadingCache = new ThreadLocal() {		protected synchronized Object initialValue() {			return new HashMap();		}	};	protected Map readingCache;..		protected Map getReadingCache(Statistic statistic) {		if (statistic.threadLocal()) {			// look for match locally			return (Map) localReadingCache.get();		} else {			return readingCache;		}	}..

Sometimes, pairing up readings can be difficult because the correlation is context sensitive. For example, session durations can be measured only if readings can be correlated across logins and logouts for a particular user. In order to support this type of semantics, you have to be able to tag a reading dynamically with a value from the log message (i.e., user name) that triggered the reading.

Unfortunately, log interception can’t tell you if a log hasn’t been generated. For a dual statistic, it is quite difficult to detect when the first filter matches but the second filter fails (as the thread was consumed by a rogue process). Only by monitoring how long a reading has been cached and setting an upper reading lifetime limit can such situations be found and alerted.

Log Scraping
Reading measurements can be generated using System.currentTimeInMillis() for time and Runtime.getRuntime().freeMemory() for memory. However, sometimes the value that you want measured is in the log message itself. You can use regular expressions to scrap the value from text. If a statistics filter specifies a scrap expression, then it will be used to extract a numeric reading value. For example, to scrap the numeric from the message “Time to process post=5.0”, use the scrap regular expression “.*post=([0-9].*)+”. The reading will contain a value of 5.0:

public Double scrapUserDefinedValue(String text, String pattern) {		// if the text is empty or null just return no reading		if ((text == null)			|| (text.length() == 0)			|| (pattern == null)			|| (pattern.length() == 0)) {			return null;		}		// get pre-compiled pattern		Pattern p  = getPattern(pattern);				// Create a matcher		Matcher m = p.matcher(text);				try {			if (!m.matches())				return null;							// Get first match group			String str = m.group(1);			// Use the formatter to parse out the value . The formatter will			// take account of the country specific formatting of the number			// whereas Double.parse will not.						DecimalFormat df = (DecimalFormat) format.get();			Number value = df.parse(str);			return new Double(value.doubleValue());		} catch (Exception ex) {			return null;		}

The UserCount statistic is slightly unusual. It is a dual statistic, but a filter match on either pattern (login or logout) will elicit a change to the number of users. So a metric is constructed and passed on to the runningcount calculator for every match.

Statistic Derivation
Once the MetricCollector has created a metric, it is immediately passed to a calculator to process. Calculators are discrete components that derive and maintain statistic results. Calculators are responsible for managing their own data. Create one calculator instance per statistic.

Define a common calculator interface. The interface formalizes the collaborations between calculators and MetricCollector, constrains calculator behavior, and allows new types of calculators to be added transparently. The Calculator interface should at a minimum provide a way to apply metrics and to access the results. Only so many different types of calculators are possible, so once you have the basic set established (i.e., count, minimum, maximum, and average) you just reuse across statistics:

Create a base calculator interface:

public interface Calculator {	..	public String getName();	public void applyMetric(Metric metric);	public void reset();	public double getResult();	public long	getTimestamp();}

The UserCount statistic uses a runningcount calculator. Create a RunningCount calculator class that implements the calculator interface. The MetricCollector will load it when it starts up. Inside, it maintains a runningcount. The count will be incremented on receipt of a login event and decremented on receipt of a logout event:

public class RunningCount extends Count {..	public void applyMetric(Metric metric) {		setTimestamp(metric.getTimestamp());		if (metric.getFirstReading() != null) {			count = count + 1;		}else {			count = count -1;		}	}	}

Statistic Reporting
You now have all the building blocks in place to intercept logs, generate metrics, and derive statistic results. You can easily add new statistics and calculator types via configuration. All that remains is to run your application and do something useful with your statistical data.

Statistic results can be stored in the DBMS for later analysis, logged, or forwarded on to other systems to manage or visualize. Dumping results to logs isn’t particularly useful for real-time monitoring, as it requires someone or something to watch the logs for events of interest. For production systems, the real payoff is alerting: the ability to alert operators in real time of issues or events that indicate production problems.

Alerting
The simplest form of alerting is threshold-based alerting. Alerts are generated when statistical values equal or exceed predefined limits. For example, when the UserCount >= 100, or when memory is low, or when the error rate exceeds a certain threshold. To extend the log interceptor so that it supports alerting, pre-configure your alerts. An Alert must specify a statistic and an alert threshold rule. Extend the MetricCollector to evaluate each alert when a metric is generated. If an alert rule fires, then a notification can be generated and sent to a special log4j alerts category. It’s then a simple matter to redirect all alert notifications (using existing log4j appenders) to the appropriate channel: stdout, file, email, NT Event Log, JMS, or some other target:

protected void calculateStatistic(Calculator calculator, Metric metric) {		calculator.applyMetric(metric);			Alert a = (Alert) alerts.get(metric.getStatisticName());				if (a != null) {			a.evaluateAlert(calculator.getResult());		}		}

Configure a UserCount alert for when the number of users exceeds 100:

alert.MaxUserLimit.description=Maximum Number of users reachedalert.MaxUserLimit.statistic=UserCountalert.MaxUserLimit.warn= >100alert.MaxUserLimit.category=alerts

To send alerts to the NT event logger, add the appropriate log4j configuration for the alerts category.

Rates—Error Rates
To illustrate another statistic, define an ErrorRate statistic for measuring the number of error occurrences over a period of time (e.g., error rate per second). The filter trigger for the statistic is any log message that contains the text “Exception”. Use the Rate statistic calculator:

statistic.ErrorRate.description=Errors per secondstatistic.ErrorRate.calculator=Ratestatistic.ErrorRate.first.match=.*Exception.*

Create a Rate calculator that implements the Calculator interface:

public class Rate extends CalculatorAdapter {		public static final long DEFAULT_PERIOD = 60000; // 1 minute (in millis)	protected long starttimestamp;	protected double	rate;	protected long 		period;	protected long		occurrences;	...		public void applyMetric(Metric metric) {						occurances = occurances + 1;		if (metric.isSingle()) {						double elapsedTime = metric.getReading(Unit.TIME) - starttimestamp;			double divisor = elapsedTime / period;			rate = occurrences / divisor;			setTimestamp(metric.getTimestamp());		}	}		public double getResult() {		return rate;	}}

Whenever an exception is detected in a log message, the Rate calculator will recalculate the ErrorRate. If you want to detect significant rate changes, define an error rate alert. For example, alert the operator via an email whenever the error rate exceeds 10 errors per minute:

alert.ErrorRateIncrease.description=ErrorRate Increasealert.ErrorRateIncrease.statistic=ErrorRatealert.ErrorRateIncrease.warn= >=5alert.ErrorRateIncrease.critical= >=10alert.ErrorRateIncrease.category=alerts

Fine-grained Control Over Logs
Log interception is just one approach to application instrumentation. The advantage of log interception is that it is real-time, simple, pure Java—and relatively lightweight. With so few classes or extra processing, it is a production-friendly approach. By leveraging log4j you have fine-grained control over which logs get processed and at which level. Measurement can be turned on and off with one line of configuration.

From the single JVM monitoring solution shown here, you can scale up for enterprise/distributed architectures. You can capture metrics from multiple JVMs and channel them to a centralized server or push them to a UI via RMI or JMS.

The downside is that log interception will only ever be as good as the data that’s logged—garbage in, garbage out. However, for applications with information-rich logs, log interception is a powerful yet simple way to detect bottlenecks and forewarn of production issues.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Related Posts